big data processing: big challenges and opportunities
TRANSCRIPT
April 10, 2013 11:17
Journal of Interconnection NetworksVol. 13, Nos. 3 & 4 (2012) 1250009 (19 pages)c© World Scientific Publishing CompanyDOI: 10.1142/S0219265912500090
BIG DATA PROCESSING: BIG CHALLENGES
AND OPPORTUNITIES∗
CHANGQING JI
College of Information Science and Technology,
Dalian Maritime University, 116026, China
College of Physical Science and Technology,
Dalian University, Dalian 116622, China
YU LI
School of Computer Science and Technology,
Dalian University of Technology, Dalian 116024, China
WENMING QIU
School of Computer Science and Technology,
Dalian University of Technology, Dalian 116024, China
Wenming [email protected]
YINGWEI JIN
School of Management,
Dalian University of Technology, Dalian 116024, China
YUJIE XU
College of Information Science and Technology,
Dalian Maritime University, Dalian 116026, China
UCHECHUKWU AWADA
School of Computer Science and Technology,
Dalian University of Technology, Dalian 116024, China
KEQIU LI
School of Computer Science and Technology,
Dalian University of Technology, Dalian 116024, China
∗A preliminary version of this paper was presented at the International Symposium on PervasiveSystems, Algorithms and Networks (I-SPAN’ 2012) in San Marcos, Texas, December 13–15, 2012.
1250009-1
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
WENYU QU
College of Information Science and Technology,
Dalian Maritime University, Dalian 116026, China
Received 20 December 2012Revised 20 January 2013
With the rapid growth of emerging applications like social network, semantic web, sensornetworks and LBS (Location Based Service) applications, a variety of data to be processedcontinues to witness a quick increase. Effective management and processing of large-scaledata poses an interesting but critical challenge. Recently, big data has attracted a lot ofattention from academia, industry as well as government. This paper introduces severalbig data processing techniques from system and application aspects. First, from the viewof cloud data management and big data processing mechanisms, we present the key issuesof big data processing, including definition of big data, big data management platform,big data service models, distributed file system, data storage, data virtualization platformand distributed applications. Following the MapReduce parallel processing framework, weintroduce some MapReduce optimization strategies reported in the literature. Finally, wediscuss the open issues and challenges, and deeply explore the research directions in thefuture on big data processing in cloud computing environments.
Keywords: Big data; cloud computing; data management; distributed processing.
1. Introduction
In the last two decades, the continuous increase of computational power has pro-
duced an overwhelming flow of data. Big data is not only becoming more avail-
able but also more understandable to computers. For example, modern high-energy
physics experiments, such as DZeroa, typically generate more than one terabyte
of data per day. The famous social network website, Facebook, serves 570 billion
pageviews, stores 3 billion new photos, and manages 25 billion pieces of contentb
montly. Google’s search and ad business, Facebook, Flickr, YouTube, and Linkedin
use a bundle of artificial-intelligence tricks. These require parsing vast quantities of
data and making decisions instantaneously. Google Earth uses 70.5 TB: 70 TB for
the raw imagery and 500 GB for the index data in 2006c. Multimedia data min-
ing platforms make it easy for everybody to achieve these goals with the minimum
amount of effort in terms of software, CPU and network. As of September 2012,
social video campaign “Kony 2012” was the fastest viral video to reach and surpass
100 million views. Recently, viral music video hit “Gangnam Style” by South Korean
artist “PSY” hit 100 million YouTube views after being online for only 52 days. On
March 29, 2012, American government announced the “Big Data Research and De-
velopment Initiative”, and big data becomes the national policy for the first timed.
ahttp://www-d0.fnal.govbhttp://www.facebook.comchttp://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.htmldhttp://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal
1250009-2
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
All these examples show that “THE ERA OF BIG DATA IS HERE” and daunting
big data challenges and significant resources were allocated to support these data
intensive operations which lead to high storage and data processing costs.
Big data and cloud computing are both the fastest-moving technologies identi-
fied in Gartner Inc.’s 2012 Hype Cycle for Emerging Technologiese. Gartner also
believes that big data is one of the tipping point technologies. Cloud computing
is associated with new paradigm for the provision of computing infrastructure and
big data processing method for all kinds of resources. Moreover, some new cloud-
based technologies have to be adopted because dealing with big data for concurrent
processing is difficult. Especially the use and processing of big data are becoming
a key basis of competition and growth for enterprises and scientific institutions.
Big data related technologies mainly include cloud computing, advanced machine
learning and intelligent data processing. The current technologies such as grid and
cloud computing have all intended to access large amounts of computing power by
aggregating resources and offering a single system view. Among these technologies,
cloud computing is becoming a powerful architecture to perform large-scale and
complex computing, and has revolutionized the way that computing infrastructure
is abstracted and used. In addition, an important aim of these technologies is to de-
liver computing as a solution for tackling big data, such as large-scale, multi-media
and high dimensional data sets.
Then what is Big Data? Viewed the words from the historical perspective, “very
large database” represents the level of GB, “Massive” represents the level of TB,
and “big data” means the level of TB or more. In the publication of the journal of
Science 2008, “Big Data” is defined as “Represents the progress of the human cog-
nitive processes, usually includes data sets with sizes beyond the ability of current
technology, method and theory to capture, manage, and process the data within a
tolerable elapsed time”.1 Recently, the definition of big data as also given by the
Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information
assets that require new forms of processing to enable enhanced decision making,
insight discovery and process optimization”.2 According to Wikimedia, “In infor-
mation technology, big data is a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools”.f
The goal of this paper is to expand our I-SPAN 2012 paper3 and provide the
status of big data researches and related works, which aims at providing a general
view of big data management technologies and applications. We also give an overview
of major approaches and classify them with respect to their strategies including
big data management platform, big data service models, distributed file system,
data storage, data virtualization platform, distributed applications and MapReduce
optimization. Finally, we discuss the open issues and challenges in processing big
data in three important aspects: big data storage, analysis and security.
ehttp://www.gartner.comfhttp://en.wikipedia.org/wiki/Big-data
1250009-3
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
The rest of the paper is organized as follows. Section 2 reviews the architecture
and the key concepts of big data processing. Sections 3 and 4 present the classifica-
tion of major distributed applications and optimization methods of the MapReduce
framework while Section 5 discusses several open issues and future challenges. Fi-
nally, Section 6 concludes this paper.
2. Big Data Management System
According to a recent survey by Gartner in 2010g, 47% of survey respondents
rank data growth in their top three challenges, followed by system performance
and scalability at 37%, and network congestion and connectivity architecture at
36%. Many researchers have suggested that commercial Data Base Management
Systems (DBMSs) are not suitable for processing extremely big data. Classic archi-
tecture’s potential bottleneck is the database server while facing peak workloads.
One database server has restriction of scalability and cost, which are two impor-
tant goals of big data processing. In order to adapt various large data processing
models, D. Kossmann et al. presented four different architectures based on classic
multi-tier database application architecture which includes partitioning, replication,
distributed control and caching architecture.4 It is clear that alternative providers
have different business models and target different kinds of applications: Google
seems to be more interested in small applications with light workload whereas Azure
is currently the most affordable service for medium to large services. Most of recent
cloud service providers are utilizing hybrid architecture that is capable of satisfy-
ing their actual service requirements. In this section, we mainly discuss big data
architecture from four key aspects: big data service models, distributed file system,
non-structural and semi-structured data storage and data virtualization platform.
2.1. Big data service model
As we all known, cloud computing is a kind of information and communication
technology, which delivers valuable resources to people as a service, such as Software
as a Service (SaaS), Infrastructure as a Service (IaaS) and Platform as a Service
(PaaS). There are several leading Information Technology (IT) solution providers
that offer these services to the customers. Now, as the concept of the big data came
up, cloud computing service model is gradually transferring into big data service
model, which are DaaS (Database as a Service), AaaS (Analysis as a Service) and
BDaaS (Big data as a Service). The detailed descriptions are as follows:
Database as a Service means that database services are available applications
deployed in any execution environment, including on a PaaS. But in the big data
context, these would optimally be scale-out architectures such as NoSQL data stress
and in-memory databases.
ghttp://publictechnology.net/sector/central-gov/gartner-data-growth-remains-challenge-data-centre-managers
1250009-4
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
Analysis as a Service would be more familiar with interacting with an ana-
lytics platform on a higher abstraction level. They would typically execute scripts
and queries that data scientists or programmers developed for them.
Big data as a Service coupled with Big Data platforms are for users that need
to customize or create new big data stacks, however, readily available solutions do
not yet exist. Users must first acquire the necessary cloud computing infrastruc-
ture, and manually install the big data processing software. For complex distributed
services, this can be a daunting challenge.5
2.2. Distributed File System
Google File System (GFS)6 is a chunk-based distributed file system that supports
fault-tolerance by data partitioning and replication. As an underlying storage layer
of Google’s cloud computing platform, it is used to read input and store output of
MapReduce.7 Similarly, Hadoop also has a distributed file system as its data stor-
age layer called Hadoop Distributed File System (HDFS),8 which is an open-source
counterpart of GFS. GFS and HDFS are user level file systems that do not imple-
ment POSIX semantics and heavily optimized for the case of large files (measured
in gigabytes).9 Amazon Simple Storage Service (S3)10 is an online public storage
web service offered by Amazon Web Services. This file system is targeted at clusters
hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. S3
aims to provide scalability, high availability, and low latency at commodity costs.
ES211 is an elastic storage system of epiC, which is designed to support both func-
tionalities within the same storage. The system provides efficient data loading from
different sources, flexible data partitioning scheme, index and parallel sequential
scan. In addition, there are several general file systems that have not to be ad-
dressed such as Moose File System (MFS), Kosmos Distributed File system (KFS).
2.3. Non-structural and Semi-structured Data Storage
With the success of the Web 2.0, most IT companies increasingly need to store and
analyze the ever growing data, such as search logs, crawled web content and click
streams collected from a variety of web services, which are usually in the range of
petabytes. However, web data sets are usually non-relational or less structured and
processing such semi-structured data sets at scale poses another challenge. Moreover,
simple distributed file systems mentioned above cannot satisfy service providers like
Google, Yahoo!, Microsoft and Amazon. All providers have their purpose to serve
potential users and own their relevant state-of-the-art of big data management sys-
tems in the cloud environment. Bigtable12 is a distributed storage system of Google
for managing structured data that is designed to scale to a very large size (petabytes
of data) across thousands of commodity servers. Bigtable does not support a full
relational data model. However, it provides clients with a simple data model that
supports dynamic control over data layout and format. PNUTS13 is a massive scale
hosted database system designed to support Yahoo! web applications. The main
1250009-5
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
focus of the system is on data serving for web applications, rather than complex
queries. Upon PNUTS, new applications can be easily built and the overhead of
creating and maintaining these applications is nothing much. The Dynamo14 is a
highly available and scalable distributed key/value-based data store which is built for
supporting internal Amazon’s applications. It provides a simple primary-key only
interface to meet the requirements of these applications. However, it differs from
key-value storage system. Facebook proposed the design of a new cluster-based data
warehouse system, Llama,15 a hybrid data management system which combines the
features of row-wise and column-wise database systems. They also described a new
column-wise file format for Hadoop called CFile, which provides better performance
than other file formats in data analysis.
2.4. Data Virtualization Platform
Data virtualization describes the process of abstracting disparate systems. It can
be described as conceptual building of abstract layers of resources. In short, big
data and cloud computing refer to a convergence of technologies and trends that are
making IT infrastructures and applications more dynamic, more modular and more
consumable. Currently, the technology of constructing virtualization platform is just
in the primary phase, which mainly depends on the cloud data center integration
technology.
The main idea behind datacenter is to leverage the virtualization technology
to maximize the utilization of computing resources. Therefore, it provides the ba-
sic ingredients such as storage, CPUs and network bandwidth as a commodity by
specialized service providers at low unit cost. For reaching the goals of big data
management, most of the research institutions and enterprises bring virtualization
into cloud architectures. Amazon Web Services (AWS), Eucalyptus, OpenNebula,
Cloud Stack and OpenStack are the most popular cloud management platforms for
infrastructure as a service (IaaS). AWS is not free but it has huge usage in elastic
platform. It is very easy to use and only pay-as-you-go. The Eucalyptus16 works
as an open source in IaaS. It uses virtual machine in controlling and managing re-
sources. Since Eucalyptus is the earliest cloud management platform for IaaS, it
signs API compatible agreement with AWS. It has a leading position in the private
cloud market for the AWS ecological environment. OpenNebula17 has integration
with various environments. It can offer the richest features, flexible ways and bet-
ter interoperability to build private, public or hybrid clouds. OpenNebula is not a
Service Oriented Architecture (SOA) design and has weak decoupling in computing,
storage and network independent components. CloudStack is an open source cloud
operating system which delivers public cloud computing similar to Amazon EC2
but using users’ own hardware. CloudStack users can take full advantage of cloud
computing to deliver higher efficiency, limitless scale and faster deployment of new
services and systems to the end user. At present, CloudStack is one of the Apache
open source projects. It already has mature functions. However, it needs to further
1250009-6
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
strengthen the loosely coupling and component design. OpenStack is a collection of
open source software projects aiming to build an open-source community with re-
searchers, developers and enterprises. People in this community share a common goal
to create a cloud that is simple to deploy, massively scalable and full of rich features.
The architecture and components of OpenStack are straightforward and stable, so it
is a good choice to provide specific applications for enterprises. In current situation,
OpenStack has a good community and ecological environment. However, it still has
some shortcomings like incomplete functions and lack of commercial supports.
3. Distributed Applications
In this age of data explosion, parallel processing is essential to perform a massive
volume of data in a timely manner. In contrast, the use of distributed techniques
and algorithms is the key to achieve better scalability and performance in pro-
cessing big data. At present, there are a lot of popular parallel and distributed
processing models, including MPI, General Purpose GPU (GPGPU), MapReduce
and MapReduce-like. We will focus on the last two processing models.
3.1. MapReduce
MapReduce proposed by Google, is a very popular big data processing model that
has rapidly been studied and applied by both industry and academia.7 MapReduce
has two major advantages: it hide details related to the data storage, distribution,
replication, load balancing and so on. Furthermore, it is so simple that programmers
only specify two functions, which are map function and reduce function. We divided
existing MapReduce applications into three categories: partitioning sub-space, de-
composing sub-processes and approximate overlapping calculations.
While MapReduce is referred to as a new approach of processing big data in cloud
computing environments, it is also criticized as a “major step backwards” compared
with DBMS.18 As the debate continues, the final result shows that neither of them
is good at what the other does well, and the two technologies are complementary.19
Recently, some DBMS vendors have integrated MapReduce front-ends into their sys-
tems including Aster, HadoopDB,20 Greenplum21 and Vertucah. Mostly of those are
still database, which simply provide a MapReduce front-end to a DBMS. HadoopDB
is a hybrid system which efficiently takes the best features from the scalability of
MapReduce and the performance of DBMS. Lately, J. Dittrich et al. proposed a new
type of system named Hadoop++22 which indicates that HadoopDB has also severe
drawbacks, including forcing user to use DBMS, changing the interface to SQL and
so on.
As the amount of data that needed to be processed grows, many data pro-
cessing methods have become not suitable or limited. So, recently, many research
hhttp://www.vertica.com/
1250009-7
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
efforts have exploited the MapReduce framework for solving challenging data pro-
cessing problems on large scale datasets in different domains, such as data mining,
information retrieval, image retrieval, machine learning and pattern recognition.
For example, Mahouti is an Apache project that aims at building scalable machine
learning libraries which are all implemented on Hadoop. The Ricardo23 is a soft sys-
tem that integrates R statistical tool and Hadoop to support parallel data analysis.
RankReduce24 perfectly combines the Local Sensitive Hashing (LSH) and MapRe-
duce, which effectively performs K-Nearest Neighbors search in the high dimensional
spaces. F. Cordeiro et al. proposed BoW25 method for clustering very large and
multi-dimensional datasets with MapReduce. MapDupReducer26 is a MapReduce
based system capable of detecting near duplicates over massive datasets efficiently.
In addition, C. Ranger et al.27 implemented the MapReduce framework on multi-
ple processors in a single machine, which gained good performance. G. Wang et al.
presented a shared-nothing, in-memory MapReduce framework called BRACE,28
which includes a high-level language for programming simulations to process simu-
lations efficiently. Recently, B. He et al. developed Mars,29 a GPU-based MapReduce
framework, which gained better performance than the state-of-the-art CPU-based
framework.
3.2. MapReduce-like
Many programmers feel uncomfortable with the MapReduce framework and prefer
to use SQL as a high-level declarative language. Several projects have been developed
to ease the task of programmers and provide high-level declarative interfaces on top
of the MapReduce framework. The declarative query languages allow query indepen-
dence from program logics, reuse of the queries and automatic query optimization
features like SQL does for DBMS. We call them the MapReduce-like system. The
Apache Pig30 project is designed as an engine for executing data flows in parallel on
Hadoop. It uses a language, called Pig Latin to express these data flows. It is built
on top of Hadoop framework, and its usage requires no modification to Hadoop.
The Apache Hive31 project is an open-source data warehousing solution built by
the Facebook Data Infrastructure Team. It supports ad-hoc queries with an SQL-
like query language called HiveQL. DryadLINQ32 is developed to translate LINQ
expressions of a program into a distributed execution plan for Dryad, Microsoft’s
parallel data processing tool.
In recent two years, it has emerged some new distributed data processing systems,
and even called beyond MapReduce. However, in essence these are all MapReduce’s
further improvements and outspreads. For instance, Google incremental calcula-
tion framework Percolator,33 real-time query system Dremel,34 distributed database
Google Spanner,35 Berkeley’s interactive real-time processing system Spark,36 data
flow computing system Yahoo!’s S4,37 Facebook’s Puma, Twitter’s Strom and so on.
ihttp://mahout.apache.org/
1250009-8
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
3.3. Application Challenge
As we all known, deploying big data applications on cloud environment is not a
trivial or straightforward task. We need to exploit the cloud computing methods to
process more areas of big data. There are several important classes of existing data
processing and applications that seem to be more compelling with cloud environ-
ments and contribute further to its momentum in the near future, such as:
Complex Multi-media Data: In the new cloud based multimedia-computing
paradigm, users store and process their multimedia application data in a distributed
manner, eliminating full installation of the media application software. Multimedia
processing in the context of cloud environments imposes great heterogeneity chal-
lenges in content-based multimedia retrieval system,38 distributed complicated data
processing, high cloud QoS support, media cloud transport protocol, media cloud
overlay network and media cloud security, P2P cloud for multimedia services, and
so on.
Physical and Virtual Worlds Data: The power of people interacting with
people in an online setting has driven the success or failure of many companies in
the internet space. There are also many difficulties such as how to organize big data
storage, and whether process it on real world or virtual world. We need to present a
new architecture and implementation of a virtual cloud to fuse of cloud computing
and virtual worlds. The large-scale of virtualized resources also need to be processed
effectively and efficiently.
Mobile Cloud Data Analytics: Smart phones and tablet remarkably started
to carry sensors like GPS, Camera and Bluetooth etc. People and devices are all
loosely connected and trillions of such connected components will generate a huge
data ocean. They are generally relying on large datasets which is difficult to be stored
on small devices with limited computing resources. Hence, these large datasets are
more conveniently to be hosted in large datacenters and accessed through the cloud
on their demand. Besides, dynamic indexing, analyzing and querying large volumes
of high-dimensional spatial big data are major challenges.
4. MapReduce Optimization
Previous works have shown that MapReduce systems are inefficient in utilizing com-
puting resources.39 In this section, we present details of approaches about improving
the performance of processing big data with MapReduce.
4.1. Data Transfer Bottlenecks
It is a big challenge that cloud users must consider how to minimize the cost of
data transmission. Consequently, researchers have begun to propose variety of ap-
proaches. Map-Reduce-Merge40 is a new model that adds a Merge phase after Re-
duce phase that combines two reduce outputs from two different MapReduce jobs
into one, which can efficiently merge data that is already partitioned and sorted (or
1250009-9
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
hashed) by Map and Reduce modules. Map-Join-Reduce41 is a system that extends
and improves MapReduce runtime framework by adding Join stage before Reduce
stage to perform complex data analysis tasks on large clusters. The authors pre-
sented a new data processing strategy which runs filtering-join aggregation tasks
with two consecutive MapReduce jobs. It adopts one-to-many shuffling scheme to
avoid frequent check pointing and shuffling of intermediate results. Moreover, dif-
ferent jobs often perform similar work, thus sharing similar work reduces overall
amount of data transfer between jobs. MRShare42 is a sharing framework proposed
by T. Nykiel et al. that transforms a batch of queries into a new batch that can be
executed more efficiently by Merging jobs into groups and evaluating each group as
a single query.
4.2. Index Optimization
Many researchers have implemented the traditional and optimized index structures
on MapReduce to obtain better performance. In Ref. 43, T. Liu et al. built hybrid
spill trees in parallel and implemented a scalable image searching algorithm which
can be used efficiently to find near duplicates among over billions of images using
MapReduce. However, the tree-based approaches have some problems. They did not
scale due to traditional top-down search that overloaded the nodes near the tree
root, and failed to provide full decentralization. Whereas Voronoi-based index44
made clusters highly scalable by its loose coupling and shared nothing architecture.
Till now, Voronoi-based index cannot process multidimensional data. Hence, the
index structure which is simple, scalable and well be used for distributed processing
mode is a best choice for the effective store and processing of the data. Later, Menon
et al., presented a novel parallel algorithm for constructing suffix array and BWT
of a sequence leveraging the unique features of MapReduce and reduced the end to
end runtime from hours to mere minutes.45 There are also some papers adapting
inverted index, which is a simple but practical index structure and appropriate for
MapReduce to process big data, such as in Ref. 46 etc. We did a large of research on
large-scale spatial data environment and designed a distributed inverted grid index
by combining inverted index and spatial grid partition with MapReduce model,
which is simple, dynamic, scalable and fits for processing high dimensional spatial
data.47 While most kinds of large data are high dimensional, so in Ref. 48, J. Wang
et al. designed a new system, epiC, in which different types of indexes were built to
provide efficient query processing for different applications.
4.3. Iterative Optimization
Classic parallel applications are developed using message passing runtimes such as
MPI (Message Passing Interface) and PVM (Parallel Virtual Machine), where par-
allel algorithms are developed using above techniques to utilize the rich set of com-
1250009-10
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
munication and synchronization constructs offered which are to create diverse com-
munication topologies. In contrast, MapReduce and similar high-level programming
models support simple communication topologies and synchronization constructs.
MapReduce also is a popular platform in which the dataflow takes the form of a di-
rected acyclic graph of operators. However, it requires lots of I/Os and unnecessary
computations while solving the problem of iterations with MapReduce. Twister49
proposed by J. Ekanayake et al. is an enhanced MapReduce runtime that supports
iterative MapReduce computations efficiently, which adds an extra Combine stage
after Reduce stage. Thus, data output from combine stage flows to the next it-
eration’s Map stage. It avoids instantiating workers repeatedly during iterations
and previously instantiated workers are reused for the next iteration with differ-
ent inputs. In their research, they also expanded the applicability of MapReduce
to more kinds of applications such as Map-only, MapReduce, iterative MapReduce
and more Extensions. HaLoop50 is similar to Twister, which is a modified version
of the MapReduce framework that supports for iterative applications by adding a
Loop control. It also allows to cache both stages’ input and output to save more
I/Os during iterations. There exist lots of iterations during graph data process-
ing. Pregel51 implements a programming model motivated by the Bulk Synchronous
Parallel(BSP) model, in which each node has its own input and transfers only some
messages which are required for the next iteration to other nodes.
4.4. Online
There are some jobs which need to process online while original MapReduce can
not do this very well. MapReduce Online52 is designed to support online aggrega-
tion and continuous queries in MapReduce. It raises an issue that frequent check
pointing and shuffling of intermediate results limit pipelined processing. The authors
modified MapReduce framework by making Mappers push their data temporarily
stored in local storage to Reducers periodically in the same MR job. In addition,
Map-side pre-aggregation is used to reduce communication. Hadoop Online Proto-
type (HOP)53 proposed by T. Condie is similar to MapReduce Online. HOP is a
modified version of MapReduce framework that allows users to early get returns
from a job as it is being computed. It also supports for continuous queries which
enable MapReduce programs to be written for applications such as event monitor-
ing and stream processing while retaining the fault tolerance properties of Hadoop.
D. Jiang et al.54 found that the merge sort in MapReduce costs lots of I/Os and
seriously affects the performance of MapReduce. In the study, the results are hashed
and pushed to hash tables held by reducers as soon as each map task outputs its in-
termediate results. Then, reducers perform aggregation on the values in each bucket.
Since each bucket in the hash table holds all values which correspond to a distinct
key, no grouping is required. In addition, reducers can perform aggregation on the
fly even when all mappers are not completed yet.
1250009-11
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
4.5. Join Query Optimization
Join Query is a popular problem in big data area. However a join problem needs
more than two inputs while MapReduce is devised for processing a single input.
R. Vernica et al.55 proposed a 3-stage approach for end-to-end set-similarity joins.
They efficiently partitioned the data across multiple nodes in order to balance the
workload and minimize the need for replication. W. Lu et al. investigated how to
perform kNN join using MapReduce.56 Mappers cluster objects into groups, then
Reducers perform the kNN join on each group of objects separately. To reduce shuf-
fling and computational costs, they designed an effective mapping mechanism that
exploits pruning rules for distance filtering. In addition, two approximate algorithms
minimize the number of replicas to reduce the shuffling cost.
4.6. Data Skew
MapReduce is not very efficient when handing skewed data, since it only considers
the key and adopts a uniform hash method to distribute workload to each reducer.
This can lead to load imbalance, increase the processing time, generate the “strag-
gler” and the final result is the performance degradation. The problem is firstly
described in Ref. 18. Google proposed the backup task which is used to alleviate the
problem of stragglers. At present, some valid approaches to data skew have been
proposed. S. Ibrahim et al. made a good partition scheme based on the key’s fre-
quency.57 They assigned keys to reduce based on cost models,58 divided the workload
into fine-grained partitions and re-assigning these partitions to other nodes which
have idle resources. Y. Xu et al. considered that the key issue is how to make every
reducer balance to each other and proposed method59 that divides a MapReduce job
into two phases: sampling MapReduce job and expected MapReduce job. They also
keep track of the distribution on intermediate keys’ frequencies and make a good
partition scheme. Therefore, the key to handle data skew is to estimate the distribu-
tion of the data and make a good partition scheme. However, it is a huge challenge
to quickly and efficiently get the data distribution information, while keeping the
high performance of the processing platform.
4.7. Scheduling Optimization
MapReduce job scheduling is one of the hot and popular topics in the context of
cloud computing. To reduce operational costs, organizations often deploy MapRe-
duce in a shared cluster. For example, the data warehouse Hadoop cluster at Face-
book contains over 2,000 machines and receives 25,000 MapReduce jobs per day on
average. Due to the large number of jobs, sharing limited resources and efficient
job scheduler is critical for improving system performance and resource utilization.
First In First Out (FIFO) Scheduler, Fair Scheduler and Capacity Scheduler are the
most widely used schedulers in practice. However, these scheduling cause job star-
vation. Lately, there are also many variations about scheduling methods. In Ref. 60,
1250009-12
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
J. Tan et al. designed and implemented Coupling Scheduler, which launching reduce
tasks depending on map task progresses. This scheduling method gains many im-
provements to job response times by up to an order of magnitude. The authors in
Ref. 61 proposed a different and flexible scheduling allocation scheme called FLEX,
which optimizes any of a variety of standard scheduling theory metrics. Sandholm
et al.62 try to improve MapReduce job scheduling using regulated dynamic prior-
itization strategy. Scheduling data flows on the cloud is also a very complex and
challenging problem, H. Kllapi et al.63 presented a framework for dataflow schedule
optimization, which gains quite promising effectiveness.
5. Discussion and Challenges
As previously mentioned, we are now in the days of big data. The top seven big
data drivers are science data, Internet data, finance data, mobile device data, sensor
data, RFID data and streaming data. Coupled with recent advances in machine
learning and reasoning, as well as rapid rises in computing power and storage, we
are transforming our ability to make sense of these increasingly large, heterogeneous,
noisy and incomplete datasets collected from a variety of sources.
So far, researchers are not able to unify around the essential features of big data.
Some of them think that big data is the data that we are not able to process using
pre-exist technology, method and theory. However, no matter how we consider the
definition of big data, the world is turning into a “helplessness” age while varies of
incalculable data is being generated by science, business and society. Big data put
forward new challenges for data management and analysis, and even for the whole
IT industry.
We consider there are three important aspects that we would encounter with
when processing big data, and we present our points of view in details as follows:
Big Data Storage and Management: Current technologies of data manage-
ment systems are not able to satisfy the needs of big data, and the increasing speed of
storage capacity is much less than that of data. Thus, a revolution of re-constructing
information framework is desperately needed. We need to design hierarchical storage
architecture. Besides, previous computer algorithms are not able to effectively stor-
age data that is directly acquired from the actual world, due to the heterogeneity
of the big data. However, they perform excellent in processing homogeneous data.
Therefore, how to re-organize data is one big problem in big data management.
Virtual server technology can exacerbate the problem, raising the prospect of over-
committed resources, especially if communication is poor between the application,
server and storage administrators. We also need to solve bottleneck problems of high
concurrent I/O and single-named node in the present Master-Slave system model.
Big Data Computation and Analysis: While processing a query in big data,
speed is a significant demand.64 However, this may cost a lot of time because it
cannot traverse all the related data in the whole database in a short time. In this
case, index will be an optimal choice. At present, indices in big data are only aiming
1250009-13
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
at simple type of data, while big data is becoming more complicated. The combi-
nation of appropriate index for big data and up-to-date preprocessing technology
will be a desirable solution when we encountered this kind of problems. Application
parallelization and divide-and-conquer are natural computational paradigms for ap-
proaching big data problems. But getting additional computational resources is not
as simple as just upgrading to a bigger and more powerful machine on the fly. The
traditional serial algorithm is inefficient for the big data. If there is enough data
parallelism in the application, users can take advantage of the cloud’s reduced cost
model to use hundreds of computers for a short time costs.
Big Data Security: By using online big data application, a lot of companies
can greatly reduce their IT costs. However, security and privacy affect the entire big
data storage and processing, since there are a massive use of third-party services
and infrastructures that are used to host important data or to perform critical
operations. The scale of data and applications grow exponentially, and bring huge
challenges of dynamic data monitoring and security protection. Unlike traditional
security method, security in big data is mainly in the form of how to process data
mining without exposing sensitive information of users. Besides, current technologies
of privacy protection are mainly based on static data set, while data is always
dynamically changed, including data pattern, variation of attribute and addition of
new data. Thus, it is a challenge to implement effective privacy protection in this
complex circumstance. In addition, legal and regulatory issues also need attention.
6. Conclusion
This paper described a systematic flow of survey on the big data processing in the
context of cloud computing. We respectively discussed the key issues, including cloud
storage and computing architecture, popular parallel processing framework, major
applications and optimization of MapReduce/MapReduce-like. Big Data is not a new
concept but very challenging. It calls for scalable storage index and a distributed
approach to retrieve required results in near real-time. It is a fundamental fact that
data is too big to process conventionally. Nevertheless, big data will be complex and
exist continuously during all big challenges, which are the big opportunities for us. In
the future, significant challenges need to be tackled by industry and academia. There
is an urgent need that computer scholars and social sciences scholars make close
cooperation, guaranteeing the long-term success of cloud computing and collectively
explore new territory.
Acknowledgments
This work is supported by the National Science Foundation for Distinguished Young
Scholars of China under grant nos. 61225010, NSFC under grant nos. of 61173160,
61173161, 61173162, 61173165, 61103234 and 61272417, Program for New Century
Excellent Talents in University (NCET-10-0095) of Ministry of Education of China.
Project of College Students’ Innovative and Entrepreneurial Training Program of
1250009-14
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
China under grant nos. of 2011022, 2011003, 201211258013 and 201211258014.
The Fundamental Research Funds for the Central Universities under grant nos.
of 2012TD008 and 2012QN029.
References
1. (2008) Big data: science in the petabyte era. Nature 455 (7209): 1 .
2. Douglas and Laney (2008) The importance of ‘big data’: A definition.
3. Ji, C., Li, Y., Qiu, W., Awada, U., and Li, K. (2012) Big data processing in cloud
computing environments. Pervasive Systems, Algorithms and Networks (ISPAN), 2012
12th International Symposium on, pp. 17–23, IEEE.
4. Kossmann, D., Kraska, T., and Loesing, S. (2010) An evaluation of alternative archi-
tectures for transaction processing in the cloud. Proceedings of the 2010 international
conference on Management of data, pp. 579–590, ACM.
5. Horey, J., Begoli, E., Gunasekaran, R., Lim, S., and Nutaro, J. (2012) Big data platforms
as a service: Challenges and approach. Proceedings of the 4th USENIX conference on
Hot Topics in Cloud Ccomputing, pp. 16–16, USENIX Association.
6. Ghemawat, S., Gobioff, H., and Leung, S. (2003) The google file system. ACM SIGOPS
Operating Systems Review , vol. 37, pp. 29–43, ACM.
7. Dean, J. and Ghemawat, S. (2008) Mapreduce: simplified data processing on large
clusters. Communications of the ACM , 51, 107–113.
8. Borthakur, D. (2007) The hadoop distributed file system: Architecture and design.
Hadoop Project Website, 11.
9. Rabkin, A. and Katz, R. (2010) Chukwa: A system for reliable large-scale log collection.
USENIX Conference on Large Installation System Administration, pp. 1–15.
10. Sakr, S., Liu, A., Batista, D., and Alomari, M. (2011) A survey of large scale data
management approaches in cloud environments. Communications Surveys & Tutorials,
IEEE , 13, 311–336.
11. Cao, Y., Chen, C., Guo, F., Jiang, D., Lin, Y., Ooi, B., Vo, H., Wu, S., and Xu, Q. (2011)
Es2: A cloud data storage system for supporting both oltp and olap. Data Engineering
(ICDE), 2011 IEEE 27th International Conference on, pp. 291–302, IEEE.
12. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra,
T., Fikes, A., and Gruber, R. (2006) Bigtable: A distributed structured data storage
system. 7th OSDI , pp. 305–314.
13. Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen,
H., Puz, N., Weaver, D., and Yerneni, R. (2008) Pnuts: Yahoo!’s hosted data serving
platform. Proceedings of the VLDB Endowment , 1, 1277–1288.
14. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007) Dynamo: amazon’s highly
available key-value store. ACM SIGOPS Operating Systems Review , vol. 41, pp. 205–
220, ACM.
15. Lin, Y., Agrawal, D., Chen, C., Ooi, B., and Wu, S. (2011) Llama: leveraging columnar
storage for scalable join processing in the mapreduce framework. Proceedings of the 2011
international conference on Management of data, pp. 961–972, ACM.
16. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L., and
Zagorodnov, D. (2009) The eucalyptus open-source cloud-computing system. Cluster
1250009-15
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
Computing and the Grid, 2009. CCGRID’09. 9th IEEE/ACM International Symposium
on, pp. 124–131, IEEE.
17. Sempolinski, P. and Thain, D. (2010) A comparison and critique of eucalyptus, open-
nebula and nimbus. Cloud Computing Technology and Science (CloudCom), 2010 IEEE
Second International Conference on, pp. 417–426, Ieee.
18. DeWitt, D. and Stonebraker, M. (2008) Mapreduce: A major step backwards. The
Database Column, 1.
19. Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., and Rasin,
A. (2010) Mapreduce and parallel dbmss: friends or foes. Communications of the ACM ,
53, 64–71.
20. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., and Rasin, A. (2009)
Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical
workloads. Proceedings of the VLDB Endowment , 2, 922–933.
21. Xu, Y., Kostamaa, P., and Gao, L. (2010) Integrating hadoop and parallel dbms. Pro-
ceedings of the 2010 international conference on Management of data, pp. 969–974,
ACM.
22. Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., and Schad, J. (2010)
Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing).
Proceedings of the VLDB Endowment , 3, 515–529.
23. Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., and McPherson, J. (2010)
Ricardo: integrating r and hadoop. Proceedings of the 2010 international conference on
Management of data, pp. 987–998, ACM.
24. Stupar, A., Michel, S., and Schenkel, R. (2010) Rankreduce–processing k-nearest neigh-
bor queries on top of mapreduce. Proceedings of the 8th Workshop on Large-Scale Dis-
tributed Systems for Information Retrieval , pp. 13–18.
25. Ferreira Cordeiro, R., Traina Junior, C., Machado Traina, A., Lopez, J., Kang, U., and
Faloutsos, C. (2011) Clustering very large multi-dimensional datasets with mapreduce.
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery
and data mining, pp. 690–698, ACM.
26. Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., and Li, R.
(2010) Mapdupreducer: detecting near duplicates over massive datasets. Proceedings of
the 2010 international conference on Management of data, pp. 1119–1122, ACM.
27. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. (2007)
Evaluating mapreduce for multi-core and multiprocessor systems. High Performance
Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on,
pp. 13–24, IEEE.
28. Wang, G., Salles, M., Sowell, B., Wang, X., Cao, T., Demers, A., Gehrke, J., and White,
W. (2010) Behavioral simulations in mapreduce. Proceedings of the VLDB Endowment ,
3, 952–963.
29. He, B., Fang, W., Luo, Q., Govindaraju, N., and Wang, T. (2008) Mars: a mapreduce
framework on graphics processors. Proceedings of the 17th international conference on
Parallel architectures and compilation techniques , pp. 260–269, ACM.
30. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. (2008) Pig latin: a
not-so-foreign language for data processing. Proceedings of the 2008 ACM SIGMOD
international conference on Management of data, pp. 1099–1110, ACM.
1250009-16
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
31. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff,
P., and Murthy, R. (2009) Hive: a warehousing solution over a map-reduce framework.
Proceedings of the VLDB Endowment , 2, 1626–1629.
32. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P., and Currey, J.
(2008) Dryadlinq: A system for general-purpose distributed data-parallel computing
using a high-level language. Proceedings of the 8th USENIX conference on Operating
systems design and implementation, pp. 1–14.
33. Peng, D. and Dabek, F. (2010) Large-scale incremental processing using distributed
transactions and notifications. Proceedings of the 9th USENIX conference on Operating
systems design and implementation, pp. 1–15, USENIX Association.
34. Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., and Vassi-
lakis, T. (2010) Dremel: interactive analysis of web-scale datasets. Proceedings of the
VLDB Endowment , 3, 330–339.
35. Corbett, J., et al. (2012) Spanner: Googles globally-distributed database. To appear in
Proceedings of OSDI , p. 1.
36. Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. (2010) Spark:
cluster computing with working sets. Proceedings of the 2nd USENIX conference on
Hot topics in cloud computing, pp. 10–10, USENIX Association.
37. Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. (2010) S4: Distributed stream com-
puting platform. Data Mining Workshops (ICDMW), 2010 IEEE International Con-
ference on, pp. 170–177, IEEE.
38. Qi, H., Li, K., Shen, Y., and Qu, W. (2012) Object-based image retrieval with kernel
on adjacency matrix and local combined features. ACM Transactions on Multimedia
Computing, Communications, and Applications (TOMCCAP), 8, 54.
39. Lee, K., Lee, Y., Choi, H., Chung, Y., and Moon, B. (2012) Parallel data processing
with mapreduce: a survey. ACM SIGMOD Record , 40, 11–20.
40. Yang, H., Dasdan, A., Hsiao, R., and Parker, D. (2007) Map-reduce-merge: simplified
relational data processing on large clusters. Proceedings of the 2007 ACM SIGMOD
international conference on Management of data, pp. 1029–1040, ACM.
41. Jiang, D., Tung, A., and Chen, G. (2011) Map-Join-Reduce: Toward scalable and effi-
cient data analysis on large clusters. Knowledge and Data Engineering, IEEE Transac-
tions on, 23, 1299–1311.
42. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., and Koudas, N. (2010) Mrshare:
Sharing across multiple queries in mapreduce. Proceedings of the VLDB Endowment ,
3, 494–505.
43. Liu, T., Rosenberg, C., and Rowley, H. (2007) Clustering billions of images with large
scale nearest neighbor search. Applications of Computer Vision, 2007. WACV’07. IEEE
Workshop on, pp. 28–28, IEEE.
44. Akdogan, A., Demiryurek, U., Banaei-Kashani, F., and Shahabi, C. (2010) Voronoi-
based geospatial query processing with mapreduce. Cloud Computing Technology and
Science (CloudCom), 2010 IEEE Second International Conference on, pp. 9–16, IEEE.
45. Menon, R., Bhat, G., and Schatz, M. (2011) Rapid parallel genome indexing with
mapreduce. Proceedings of the second international workshop on MapReduce and its
applications , pp. 51–58, ACM.
1250009-17
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
C. Ji et al.
46. Logothetis, D. and Yocum, K. (2008) Ad-hoc data processing in the cloud. Proceedings
of the VLDB Endowment , 1, 1472–1475.
47. Ji, C., Dong, T., Li, Y., Shen, Y., Li, K., Qiu, W., Qu, W., and Guo, M. (2012) Inverted
grid-based knn query processing with mapreduce. ChinaGrid, 2012 Seventh ChinaGrid
Annual Conference on, pp. 25–33, IEEE.
48. Wang, J., Wu, S., Gao, H., Li, J., and Ooi, B. (2010) Indexing multi-dimensional data
in a cloud system. Proceedings of the 2010 international conference on Management of
data, pp. 591–602, ACM.
49. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S., Qiu, J., and Fox, G. (2010)
Twister: a runtime for iterative mapreduce. Proceedings of the 19th ACM International
Symposium on High Performance Distributed Computing, pp. 810–818, ACM.
50. Bu, Y., Howe, B., Balazinska, M., and Ernst, M. (2010) Haloop: Efficient iterative data
processing on large clusters. Proceedings of the VLDB Endowment , 3, 285–296.
51. Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski,
G. (2010) Pregel: a system for large-scale graph processing. Proceedings of the 2010
international conference on Management of data, pp. 135–146, ACM.
52. Condie, T., Conway, N., Alvaro, P., Hellerstein, J., Elmeleegy, K., and Sears, R. (2010)
Mapreduce online. Proceedings of the 7th USENIX conference on Networked systems
design and implementation, pp. 21–21.
53. Condie, T., Conway, N., Alvaro, P., Hellerstein, J., Gerth, J., Talbot, J., Elmeleegy, K.,
and Sears, R. (2010) Online aggregation and continuous query support in mapreduce.
ACM SIGMOD , pp. 1115–1118.
54. Jiang, D., Ooi, B., Shi, L., and Wu, S. (2010) The performance of mapreduce: An
in-depth study. Proceedings of the VLDB Endowment , 3, 472–483.
55. Vernica, R., Carey, M., and Li, C. (2010) Efficient parallel set-similarity joins using
mapreduce. SIGMOD conference, pp. 495–506, Citeseer.
56. Zhang, C., Li, F., and Jestes, J. (2012) Efficient parallel knn joins for large data in
mapreduce. Proceedings of the 15th International Conference on Extending Database
Technology, pp. 38–49, ACM.
57. Ibrahim, S., Jin, H., Lu, L., Wu, S., He, B., and Qi, L. (2010) Leen: Locality/fairness-
aware key partitioning for mapreduce in the cloud. Cloud Computing Technology and
Science (CloudCom), 2010 IEEE Second International Conference on, pp. 17–24, IEEE.
58. Gufler, B., Augsten, N., Reiser, A., and Kemper, A. (2012) Load balancing in mapreduce
based on scalable cardinality estimates. Data Engineering (ICDE), 2012 IEEE 28th
International Conference on, pp. 522–533, IEEE.
59. Xu, Y., Zou, P., Qu, W., Li, Z., Li, K., and Cui, X. (2012) Sampling-based partitioning
in mapreduce for skewed data. ChinaGrid, 2012 Seventh ChinaGrid Annual Conference
on, pp. 1–8, IEEE.
60. Tan, J., Meng, X., and Zhang, L. (2012) Delay tails in mapreduce scheduling. Proceed-
ings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference
on Measurement and Modeling of Computer Systems , pp. 5–16, ACM.
61. Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K., and
Balmin, A. (2010) Flex: A slot allocation scheduling optimizer for mapreduce workloads.
Middleware 2010 , pp. 1–20.
1250009-18
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.
April 10, 2013 11:17
Big Data Processing: Big Challenges and Opportunities
62. Sandholm, T. and Lai, K. (2009) Mapreduce optimization using regulated dynamic pri-
oritization. Proceedings of the eleventh international joint conference on Measurement
and modeling of computer systems , pp. 299–310, ACM.
63. Kllapi, H., Sitaridi, E., Tsangaris, M., and Ioannidis, Y. (2011) Schedule optimization
for data processing flows on the cloud. Proceedings of the ACM SIGMOD International
Conference on Management of Data, pp. 289–300.
64. Zhou, X., Lu, J., Li, C., and Du, X. (2012) Big data challenge in the management
perspective. Communications of the CCF , 8, 16–20.
1250009-19
J. I
nter
. Net
. 201
2.13
. Dow
nloa
ded
from
ww
w.w
orld
scie
ntif
ic.c
omby
UN
IVE
RSI
DA
DE
FE
DE
RA
L D
E S
AP
PAU
LO
on
10/2
1/13
. For
per
sona
l use
onl
y.