survey paper on big data - ijarcsseijarcsse.com/before_august_2017/docs/papers/volume_6/8... ·...

© 2016, IJARCSSE All Rights Reserved Page | 368

Volume 6, Issue 8, August 2016 ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Survey Paper on Big Data C. Lakshmi

*, V. V. Nagendra Kumar

MCA Department, RGMCET, Nandyal,

Andhra Pradesh, India

Abstract— Big data is the term for any collection of datasets so large and complex that it becomes difficult to process

using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing,

storage, transfer, visualization, and privacy violations. Big data is a set of techniques and technologies that require

new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a

massive scale. Big data environment is used to acquire, organize and analyze the various types of data. Data that is so

large in volume, so diverse in variety or moving with such velocity is called Big data. Analyzing Big Data is a

challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. The

technologies used by big data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL

and HPCC. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic

framework to decompose big data systems into four sequential modules, namely data generation, data acquisition,

data storage, and data analytics. These four modules form a big data value chain. Following that, we present a

detailed survey of Materials and methods used in research and industry communities. In addition, we present the

prevalent Hadoop framework for addressing big data. Finally, we outline Big data system architecture and present key

challenges of research directions for big data system.

Keywords— Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC

I. INTRODUCTION

Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data

new trend and internet population grew day by day but it never reach by 100%. The need of big data generated from the

large companies like facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data

which is in unstructured form or even in structured form. Google contains the large amount of information. So; there is

the need of Big Data Analytics that is the processing of the complex and massive datasets This data is different from

structured data in terms of five parameters –variety, volume, value, veracity and velocity (5V’s). The five V’s (volume,

variety, velocity, value, veracity) are the challenges of big data management are:

1. Volume:

Data is ever-growing day by day of alltypes ever MB, PB, YB, ZB, KB, TB of information. The data results into

large files. Excessive volume of data is main issue of storage. This main issue is resolved by reducing storage cost.

Data volumes are expected to grow 50 times by 2020.

2. Variety:

Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured

or unstructured such as text, audio, videos, log files and more. The varieties are endless, and the data enters the

network without having been quantified or qualified in any way.

3. Velocity:

The data comes at high speed.Sometimes 1 minute is too late so big data is time sensitive. Some organisations data

velocity is main challenge. The social media messages and credit card transactions done in millisecond and data

generated by this putting in to databases.

4. Value:

It is a most important v in big data. Value is main buzz for big data because it is important for businesses, IT

infrastructure system to store large amount of values in database.

5. Veracity: The increase in the range of valuestypical of a large data set. When we dealing with high volume, velocity and

variety of data, the all of data are not going 100% correct, there will be dirty data. Big data and analytics

technologies work with these types of data.

Huge volume of data (both structured and unstructured) is management by organization, administration and

governance. Unstructured data is a data that is not present in a database.

Unstructured data may be text, verbal data or in another form. Textual unstructured data is like power point

presentation, email messages, word documents, and instant massages. Data in another format can be.jpg images, .png

images and audio files.

http://www.ijarcsse.com/

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8),

August- 2016, pp. 368-381


Fig. 1 Parameters of Big Data

Fig.2 illustrates a general big data network model with MapReduce. A distinct application in the cloud has put

demanding requirements for acquisition, transportation and analytics of structured and unstructured data.

The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy

violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of

related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to

"spot business trends, prevent diseases, and combat crime and so on". Scientists regularly encounter limitations due to

large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and

biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data

sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices,

aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification

(RFID) readers, and wireless sensor networks. The world's technological per-capita capacity to store information has

roughly doubled every 40 months since the 1980s;as of 2012, every day 2.5 exabytes (2.5×1018) of data were created.

The challenge for large enterprises is determining who should own big data initiatives that straddle the entire

organization.

Fig. 2 General Framework of Big Data Networking


August- 2016, pp. 368-381


The history of big data can be roughly split into the following stages:

Megabyte to Gigabyte: In the 1970s and 1980s, his-torical business data introduced the earliest ``big data''

challenge in moving from megabyte to gigabyte sizes. The urgent need at that time was to house that data and run

relational queries for business analyses and report-ing. Research efforts were made to give birth to the ``database

machine'' that featured integrated hardware and software to solve problems. The underlying philos-ophy was that such

integration would provide better per-formance at lower cost. After a period of time, it became clear that hardware-

specialized database machines could not keep pace with the progress of general-purpose com-puters. Thus, the

descendant database systems are soft-ware systems that impose few constraints on hardware and can run on general-

purpose computers.

Gigabyte to Terabyte: In the late 1980s, the popular-ization of digital technology caused data volumes to expand to

several gigabytes or even a terabyte, which is beyond the storage and/or processing capabilities of a single large

computer system. Data parallelization was proposed to extend storage capabilities and to improve performance by

distributing data and related tasks, such as building indexes and evaluating queries, into disparate hardware. Based on

this idea, several types of parallel databases were built, including shared-memory databases, shared-disk databases, and

shared-nothing databases, all as induced by the underlying hardware architecture. Of the three types of databasTerabyte

to Petabyte: During the late 1990s, whenthe database community was admiring its `` nished'' work on the parallel

database, the rapid development of Web 1.0 led the whole world into the Internet era, along with massive semi-structured

or unstructured web-pages holding terabytes or petabytes (PBs) of data.

The resulting need for search companies was to index and query the mushrooming content of the web. Unfor-

tunately, although parallel databases handle structured data well, they provide little support for unstructured data.

Additionally, systems capabilities were limited to less than several terabyte.

Petabyte to Exabyte: Under current development trends,data stored and analyzed by big companies will undoubt-

edly reach the PB to exabyte magnitude soon. However, current technology still handles terabyte to PB data; there has

been no revolutionary technology developed to cope with larger datasets.

II. LITERATURE SURVEY

1 Hadoop Map Reduce is a large scale, open source software framework dedicated to scalable, distributed, data-

intensive computing. The framework breaks up large data into smaller parallelizable chunks and handles scheduling ▫

Maps each piece to an intermediate value ▫ Reduces intermediate values to a solution ▫ User-specified partition and

combiner options Fault tolerant, reliable, and supports thousands of nodes and petabytes of data • If you can rewrite

algorithms into MapReduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop’s

Map Reduce is the way to go for a distributed problem solving approach to large datasets • Tried and tested in production

• Many implementation options. We can present the design and evaluation of a data aware cache framework that requires

minimum change to the original Map Reduce programming model for provisioning incremental processing for Big Data

applications using the Map Reduce model [4].

2 The author [2] stated the importance of some of the technologies that handle Big Data like Hadoop, HDFS and

Map Reduce. The author suggested about various schedulers used in Hadoop and about the technical aspects of Hadoop.

The author also focuses on the importance of YARN which overcomes the limitations of Map Reduce.

3 The author [3] have surveyed various technologies to handle the big data and there architectures. In this paper we

have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a

disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage,

real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main

goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data

from different sources and improves overall performance of systems.

4 The author continue with the Big Data definition and enhance the definition given in [3] that includes the 5V Big

Data properties: Volume, Variety, Velocity, Value, Veracity, and suggest other dimensions for Big Data analysis and

taxonomy, in particular comparing and contrasting Big Data technologies in e-Science, industry, business, social media,

healthcare.

With a long tradition of working with constantly increasing volume of data, modern e-Science can offer industry the

scientific analysis methods, while industry can bring advanced and fast developing Big Data technologies and tools to

science and wider public.[1]

5 The author [6] stated the need to process enormous quantities of data has never been greater. Not only are terabyte

- and petabyte scale datasets rapidly becoming commonplace, but there is consensus that great value lies buried in them,

waiting to be unlocked by the right computational tools. In the commercial sphere, business intelligence, driven by the

ability to gather data from a dizzying array of sources. Big Data analysis tools like Map Reduce over Hadoop and HDFS,

promises to help organizations better understand their customers and the marketplace, hopefully leading to better

business decisions and competitive advantages [3].

6 The author [5] stated there is a need to maximize returns on BI investments and to overcome difficulties. Problems

and new trends mentioned in this article and finding solutions by combination of advanced tools, techniques and methods

would help readers in BI projects and implementations. BI vendors are struggling and doing continuous effort to bring


August- 2016, pp. 368-381


technical capabilities and to provide complete out of the box solution with set of tools and techniques. In 2014, due to

rapid change in BI maturity, BI teams are facing tough time to have infrastructure with less skilled resources.

Consolidation and convergence is going on, market is coming up with wide range of new technologies. Still the ground is

immature and in a state of rapid evolution.

7 The author [8] given some important emerging framework model design for Big Data Analytics and a 3-tier

architecture model for Big Data in Data Mining. In the proposed 3-tier architecture model is more scalable in working

with different environment and also benefits to overcome with the main issue in Big Data Analytics for storing,

Analyzing, and visualization. The framework model given for Hadoop HDFS distributed data storage, real-time Nosql

databases, and MapReduce distributed data processing over a cluster of commodity servers.

8 Big data framework needs to consider complex relationships between samples, models and data sources along with

their evolving changes with time and other possible factors. To support Big data mining high performance computing

platforms are required. With Big data technologies [3] we will hopefully be able to provide most relevant and most

accurate social sensing feedback to better understand our society at real time [7].

9 There are lots of scheduling technique are availab le to improve job performance but a ll the technique have some

litt le limitation so any one technique cannot overcome that particular para meter in which they effecting the performance

to whole system. Like data locality, fa irness, load balance, straggler proble m and deadline constrains. All the technique

has advantages over any other technique so if we co mbined or interchange some technique then the result will be even

much better than the individual scheduling technique [10].

10 The author [9] describes the concept of Big Data along with 3 Vs, Volume, Velocity and variety of Big Data. The

paper also focuses on Big Data processing problems. These technical challenges must be addressed for efficient and fast

processing of Big Data. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of

structure, error -handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from

data acquisition to result interpretation. These technical challenges are common across a large variety of application

domains, and therefore not cost -effective to address in the context of one domain alone. The paper describes Hadoop

which is an open source software used for processing of Big Data.

11 The author [12] proposed system is based on implementation of Online Aggregation of Map Reduce in Hadoop

for ancient big data processing. Traditional MapReduce implementations materialize the intermediate results of mappers

and do not allow pipelining between the map and the reduce phases. This approach has the advantageof simple recovery

in the case of failures, however, reducers cannot start executing tasks before all mappers have finished. As the Map

Reduce Online is a modeled version of Hadoop Map Reduce, it supports Online Aggregation and stream

processing,while also improving utilization and reducing response time.

12 The author [11] stated learning from the application studies, we explore the design space for supporting data-

intensive and compute-intensive applications on large data-center-scale computer systems. Traditional data processing

and storage approaches are facing many challenges in meeting the continuously increasing computing demands of Big

Data.

This work focused on Map Reduce, one of the key enabling approaches for meeting Big Data demands by means of

highly parallel processing on a large number of commodity nodes.

III. TECHNOLOGIES AND METHODS

All paragraphs must be indented.

Big data is a new concept for handling massive data therefore the architectural description of this technology is very

new. There are the different technologies which use almost same approach i.e. to distribute the data among various local

agents and reduce the load of the main server so that traffic can be avoided. There are endless articles, books and

periodicals that describe Big Data from a technology perspective so we will instead focus our efforts here on setting out

some basic principles and the minimum technology foundation to help relate Big Data to the broader IM domain.

A. Hadoop

Hadoop is a framework that can run applications on systems with thousands of nodes and terabytes. It distributes the

file among the nodes and allows to system continue work in case of a node failure. This approach reduces the risk of

catastrophic system failure.

In which application is broken into smaller parts (fragments or blocks).Apache Hadoop consists of the Hadoop

kernel, Hadoop distributed file system (HDFS), map reduce and related projects are zookeeper, Hbase, Apache Hive.

Hadoop Distributed File System consists of three Components: the Name Node, Secondary Name Node and Data Node.

The multilevel secure (MLS) environmental problems of Hadoop by using security enhanced Linux (SE Linux) protocol.

In which multiple sources of Hadoop applications run at different levels.

This protocol is an extension of Hadoop distributed file system. Hadoop is commonly used for distributed batch

index building; it is desirable to optimize the index capability in near real time. Hadoop provides components for storage

and analysis for large scale processing. Now a day’s Hadoop used by hundreds of companies.

The advantage of Hadoop is Distributed storage & Computational capabilities, extremely scalable, ptimized for high

throughput, large block sizes, tolerant of software and hardware failure.


August- 2016, pp. 368-381


Fig. 3 Architecture of Hadoop

Components of Hadoop:

HBase: It is open source, distributed and Non-relational database system implemented in Java. It runs above the

layer of HDFS. It can serve the input and output for the Map Reduce in well mannered structure.

Oozie: Oozie is a web-application that runs in ajava servlet. Oozie use the database to gather the information of

Workflow which is a collection of actions. It manages the Hadoop jobs in a mannered way.

Sqoop: Sqoop is a command-line interface application that provides platform which is used for converting data from

relational databases and Hadoop or vice versa.

Avro: It is a system that provides functionality ofdata serialization and service of data exchange. It is basically used

in Apache Hadoop. These services can be used together as well as independently according the data records.

Chukwa: Chukwa is a framework that is used fordata collection and analysis to process and analyze the massive

amount of logs. It is built on the upper layer of the HDFS and Map Reduce framework.

Pig: Pig is high-level platform where the MapReduce framework is created which is used with Hadoop platform. It

is a high level data processing system where the data records are analyzed that occurs in high level language.

Zookeeper: It is a centralization based service thatprovides distributed synchronization and provides group services

along with maintenance of the configuration information and records.

Hive: It is application developed for datawarehouse that provides the SQL interface as well as relational model. Hive

infrastructure is built on the top layer of Hadoop that help in providing conclusion, and analysis for respective queries.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug Cutting, who was working at Yahoo! at the

time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine

project. Hadoop is open- source software that enables reliable, scalable, distributed computing on clusters of inexpensive

servers.

Hadoop is:

Reliable: The software is fault tolerant, it expectsand handles hardware and software failures

Scalable: Designed for massive scale ofprocessors, memory, and local attached storage

Distributed: Handles replication. Offers massivelyparallel programming model, Map Reduce.

Fig. 4 Hadoop System


August- 2016, pp. 368-381


Hadoop is an Open Source implementation of a large-scale batch processing system. That use the Map-Reduce

framework introduced by Google by leveraging the concept of map and reduce functions that well known used in

Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy custom-

written programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands

of commodity servers. It is optimized for contiguous read requests(streaming reads), where processing includes of

scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from

minutes to hours. While Hadoop can processes data fast, so its key advantage is its massive scalability.

Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction

in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and

clickstream. While many of these applications could in fact be implemented in a relational database(RDBMS), the main

core of the Hadoop framework is functionally different from an RDBMS. The following discusses some of these

differences Hadoop is particularly useful when:

Complex information processing is needed:

Unstructured data needs to be turned into structured data.

Queries can’t be reasonably expressed using SQL Heavily recursive algorithms.

Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.

Machine learning:

Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB) .

Data value does not justify expense of constant real-time availability, such as archives or special interest info,

which can be moved to Hadoop and remain available at lower cost.

Results are not needed in real time Fault tolerance is critical.

Significant custom coding would be required to:

Handle job scheduling.

Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into

numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug

Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop

ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of

related projects such as Apache Hive, HBase and Zookeeper. The Hadoop framework is used by major players including

Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating

systems are Windows and Linux but Hadoop can also work with BSD and OS X.

HDFS

The Hadoop Distributed File System (HDFS) is the file system component of the Hadoop framework. HDFS is

designed and optimized to store data over a large amount of low-cost hardware in a distributed fashion.

Name Node:

Name node is a type of the master node, which is having the information that means meta data about the all data

node there is address(use to talk ), free space, data they store, active data node , passive data node, task tracker, job

tracker and many other configuration such as replication of data.

Fig. 5 HDFS Architecture

The NameNode records all of the metadata, attributes, and locations of files and data blocks in to the DataNodes.

The attributes it records are the things like file permissions, file modification and access times, and namespace, which is

a hierarchy of files and directories. The NameNode maps the namespace tree to file blocks in DataNodes. When a client


August- 2016, pp. 368-381


node wants to read a file in the HDFS it first contacts the Namenode to receive the location of the data blocks associated

with that file.

A NameNode stores information about the overall system because it is the master of the HDFS with the DataNodes

being the slaves. It stores the image and journal logs of the system. The NameNode must always store the most up to date

image and journal. Basically, the NameNode always knows where the data blocks and replicates are for each file and it

also knows where the free blocks are in the system so it keeps track of where future files can be written.

Data Node:

Data node is a type of slave node in the hadoop, which is used to save the data and there is task tracker in data node

which is use to track on the ongoing job on the data node and the jobs which coming from name node.

The DataNodes store the blocks and block replicas of the file system. During startup each DataNode connects and

performs a handshake with the NameNode. The DataNode checks for the accurate namespace ID, and if not found then

the DataNode automatically shuts down. New DataNodes can join the cluster by simply registering with the NameNode

and receiving the namespace ID. Each DataNode keeps track of a block report for the blocks in its node. Each DataNode

sends its block report to the NameNode every hour so that the NameNode always has an up to date view of where block

replicas are located in the cluster.During the normal operation of the HDFS, each DataNode also sends a heartbeat to the

NameNode every ten minutes so that the NameNode knows which DataNodes are operating correctly and are available.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules.

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines,

providing very high aggregate bandwidth across the cluster.

Hadoop MapReduce – a programming model for large scale data processing.

Fig. 6 Hadoop Ecosystem

All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus

should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components

originally derived respectively from Google's MapReduce and Google File System (GFS) papers."Hadoop" often refers

not to just the base Hadoop package but rather to the Hadoop Ecosystem fig.6 which includes all of the additional

software packages that can be installed on top of or alongside Hadoop, such as Apache Hive, Apache Pig and Apache

Spark.

B. Map Reduce:

Map-Reduce was introduced by Google in order to process and store large datasets on commodity hardware. Map

Reduce is a model for processing large-scale data records in clusters. The Map Reduce programming model is based on

two functions which are map() function and reduce() function. Users can simulate their own processing logics having

well defined map() and reduce() functions. Map function performs the task as the master node takes the input, divide into

smaller sub modules and distribute into slave nodes. A slave node further divides the sub modules again that lead to the

hierarchical tree structure. The slave node processes the base problem and passes the result back to the master Node. The

Map Reduce system arrange together all intermediate pairs based on the intermediate keys and refer them to reduce()

function for producing the final output. Reduce function works as the master node collects the results from all the sub

problems and combines them together to form the output.

Map(in_key,in_value)---

>list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value))---

>list(out_value)

The parameters of map () and reduce () function is as follows:

map (k1,v1) ! list (k2,v2) and reduce (k2,list(v2)) ! list (v2) A Map Reduce framework is based on a master-slave architecture where one master node handles a number of slave

nodes . Map Reduce works by first dividing the input data set into even-sized data blocks for equal load distribution.

Each data block is then assigned to one slave node and is processed by a map task and result is generated. The slave node

interrupts the master node when it is idle. The scheduler then assigns new tasks to the slave node. The scheduler takes

data locality and resources into consideration when it disseminates data blocks.


August- 2016, pp. 368-381


Fig. 7 Architecture of Map Reduce

Figure 7 shows the Map Reduce Architecture and Working. It always manages to allocate a local data block to a

slave node. If the effort fails, the scheduler will assign a rack-local or random data block to the slave node instead of

local data block. When map() function complete its task, the runtime system gather all intermediate pairs and launches a

set of condense tasks to produce the final output. Large scale data processing is a difficult task, managing hundreds or

thousands of processors and managing parallelization and distributed environments makes is more difficult. Map Reduce

provides solution to the mentioned issues, as is supports distributed and parallel I/O scheduling, it is fault tolerant and

supports scalability and it has inbuilt processes for status and monitoring of heterogeneous and large datasets as in Big

Data. It is way of approaching and solving a given problem. Using Map Reduce framework the efficiency and the time to

retrieve the data is quite manageable. To address the volume aspect, new techniques have been proposed to enable

parallel processing using Map Reduce framework. Data aware caching (Dache) framework that made slight change to the

original map reduce programming model and framework to enhance processing for big data applications using the map

reduce model.

The advantage of map reduce is a large variety of problems are easily expressible as Map reduce computations and

cluster of machines handle thousands of nodes and fault-tolerance.

The disadvantage of map reduce is Real-time processing, not always very easy to implement, shuffling of data, batch

processing.

Map Reduce Components:

1. Name Node: manages HDFS metadata, doesn’tdeal with files directly.

2. Data Node: stores blocks of HDFS—default replication level for each block: 3.

3. Job Tracker: schedules, allocates and monitorsjob execution on slaves—Task Trackers.

4. Task Tracker: runs Map Reduce operations.

Map Reduce Framework

Map Reduce is a software framework for distributed processing of large data sets on computer clusters. It is first

developed by Google .Map Reduce is intended to facilitate and simplify the processing of vast amounts of data in parallel

on large clusters of commodity hardware in a reliable, fault-tolerant manner.

MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

Typical Hadoop cluster integrates MapReduce and HFDS layer. In MapReduce layer job tracker assigns tasks to the task

tracker.Master node job tracker also assigns tasks to the slave node task tracker fig.8.

Fig. 8 Map reduce is based on the Maser-Slave architecture


August- 2016, pp. 368-381


Master node contains -

Job tracker node (MapReduce layer) Task tracker node (MapReduce layer) Name node (HFDS layer)

Data node (HFDS layer)

Multiple slave nodes contain -

Task tracker node (MapReduce layer) Data node (HFDS layer)

MapReduce layer has job and task tracker nodes HFDS layer has name and data nodes

C. Hive:

Hive is a distributed agent platform, a decentralized system for building applications by networking local system

resources. Apache Hive data warehousing component, an element of cloud-based Hadoop ecosystem which offers a

query language called HiveQL that translates SQL-like queries into Map Reduce jobs automatically. Applications of

apache hive are SQL, oracle, IBM DB2. Architecture is divided into Map-Reduce-oriented execution, Meta data

information for data storage, and an execution part that receives a query from user or applications for execution.

Fig. 9 Architecture of HIVE

The advantage of hive is more secure and implementations are good and well tuned.The disadvantage of hive is only

for ad hoc queries and performance is less as compared to pig.

D. No-SQL:

No-SQL database is an approach to data management and data design that’s useful for very large sets of distributed

data. These databases are in general part of the real-time events that are detected in process deployed to inbound channels

but can also be seen as an enabling technology following analytical capabilities such as relative search applications.

These are only made feasible because of the elastic nature of the No-SQL model where the dimensionality of a query is

evolved from the data in scope and domain rather than being fixed by the developer in advance.

It is useful when enterprise need to access huge amount of unstructured data. There are more than one hundred No

SQL approaches that specialize in management of different multimodal data types (from structured to non-structured)

and with the aim to solve very specific challenges. Data Scientist, Researchers and Business Analysts in specific pay

more attention to agile approach that leads to prior insights into the data sets that may be concealed or constrained with a

more formal development process. The most popular No-SQL database is Apache Cassandra.

The advantage of No-SQL is open source, Horizontal scalability, Easy to use, store complex data types, Very fast for

adding new data and for simple operations/queries. The disadvantage of No-SQL is Immaturity, No indexing support, No

ACID, Complex consistency models, Absence of standardization.


August- 2016, pp. 368-381


Fig. 10 Architecture of No SQL

E. HPCC:

HPCC is an open source platform used for computing and that provides the service for handling of massive big data

workflow. HPCC data model is defined by the user end according to the requirements. HPCC system is proposed and

then further designed to manage the most complex and data-intensive analytical related problems. HPCC system is a

single platform having a single architecture and a single programming language used for the data simulation.HPCC

system was designed to analyze the gigantic amount of data for the purpose of solving complex problem of big data.

HPCC system is based on enterprise control language which has the declarative and on-procedural nature programming

language the main components of HPCC are:

HPCC Data Refinery: Use parallel ETL enginemostly.

HPCC Data Delivery: It is massively based onstructured query engine used.

Enterprise Control Language distributes the workload between the nodes in appropriate even load.

IV. EXPERIMENTS ANALYSIS

Big-Data System Architecture:

In this section, we focus on the value chain for big data analytics. Speci cally, we describe a big data value chain that

consists of four stages (generation, acquisition, storage, and processing). Next, we present a big data technology map that

associates the leading technologies in this domain with speci c phases in the big data value chain and a time stamp.

Big-Data System: A Value-Chain View

A big-data system is complex, providing functions to deal with different phases in the digital data life cycle, ranging

from its birth to its destruction. At the same time, the system usually involves multiple distinct phases for different

applications. In this case, we adopt a systems-engineering approach, well accepted in industry, to decom-pose a typical

big-data system into four consecutive phases, including data generation, data acquisition, data storage, and data analytics,

as illustrated in the horizontal axis of Fig. 3. Notice that data visualization is an assistance method for data analysis. In

general, one shall visualize data to and some.

The details for each phase are explained as follows.

Data generation concerns how data are generated. In this case, the term ``big data'' is designated to mean large,

diverse, and complex datasets that are generated from various longitudinal and/or distributed data sources, including sen-

sors, video, click streams, and other available digital sources.

Normally, these datasets are associated with different levels of domain-speci c values. In this paper, we focus on

datasets from three prominent domains, business, Internet, and scienti c research, for which values are relatively easy to

understand.However, there are overwhelming technical chal-lenges in collecting, processing, and analyzing these

datasets that demand new solutions to embrace the latest advances in the information and communications technology

(ICT) domain.

Data acquisition refers to the process of obtaining informa-tion and is subdivided into data collection, data

transmission, and data pre-processing. First, because data may come from a diverse set of sources, websites that host

formatted text, images and/or videos - data collection refers to dedicated data collection technology that acquires raw

data from a spe-ci c data production environment. Second, after collecting raw data, we need a high-speed transmission

mechanism to transmit the data into the proper storage sustaining system for various types of analytical applications.


August- 2016, pp. 368-381


Fig. 11 Big data technology map. It pivots on two axes, i.e., data value chain and timeline. The data value chain

divides thedata lifecycle into four stages, including data generation, data acquisition, data storage, and data analytics. In

each stage, we highlight exemplary technologies over the past 10 years. rough patterns rst, and then employ speci c data

mining methods. I mention this in data analytics section.

Finally, collected datasets might contain many meaningless data, which unnecessarily increases the amount of storage

space and affects the consequent data analysis. For instance, redundancyis common in most datasets collected from

sensors deployed to monitor the environment, and we can use data compres-sion technology to address this issue. Thus,

we must per-form data pre-processing operations for ef cient storage and mining.

Data storage concerns persistently storing and managinglarge-scale datasets. A data storage system can be divided

into two parts: hardware infrastructure and data manage-ment. Hardware infrastructure consists of a pool of shared ICT

resources organized in an elastic way for various tasks in response to their instantaneous demand. The hardware

infrastructure should be able to scale up and out and be able to be dynamically recon gured to address different types of

application environments. Data management software is deployed on top of the hardware infrastructure to main-tain

large-scale datasets. Additionally, to analyze or interact with the stored data, storage systems must provide several

interface functions, fast querying and other programming models.

Data analysis leverages analytical methods or tools toinspect, transform, and model data to extract value. Many

application elds leverage opportunities presented by abun-dant data and domain-speci c analytical methods to derive the

intended impact. Although various elds pose dif-ferent application requirements and data characteristics, a few of these

elds may leverage similar underlying tech-nologies. Emerging analytics research can be classi ed into six critical

technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, net-work analytics, and

mobile analytics.

Big-Data System: A Layered View:

Alternatively, the big data system can be decomposed into a layered structure, as illustrated in Fig. 12.


August- 2016, pp. 368-381


Fig. 12 Layered architecture of big data system. It can be decomposedinto three layers, including infrastructure layer,

computing layer, and application layer, from bottom to up.

The layered structure is divisible into three layers, i.e., the infrastructure layer, the computing layer, and the

application layer, from bottom to top. This layered view only provides a conceptual hierarchy to underscore the

complexity of a big data system. The function of each layer is as follows.

The infrastructure layer consists of a pool of ICT resources, which can be organized by cloud computing

infrastructure and enabled by virtualization technology. These resources will be exposed to upper-layer systems in a ne-

grained manner with a speci c service-level agreement (SLA). Within this model, resources must be allocated to meet the

big data demand while achieving resource ef ciency by maximizing system utilization, energy awareness, operational

simpli cation, etc.

The computing layer encapsulates various data tools into a middleware layer that runs over raw ICT resources. In

the context of big data, typical tools include data inte-gration, data management, and the programming model. Data

integration means acquiring data from disparate sources and integrating the dataset into a uni ed form with the necessary

data pre-processing operations. Data management refers to mechanisms and tools that provide persistent data storage and

highly ef cient management, such as distributed le systems and SQL or NoSQL data stores. The programming model

implements abstraction application logic and facilitates the data analysis appli-cations. MapReduce, Dryad, Pregel, and

Dremel exemplify programming models.

The application layer exploits the interface provided by the programming models to implement various data

analysis functions, including querying, statistical anal-yses, clustering, and classi cation; then, it combines basic

analytical methods to develop various led related applications. McKinsey presented ve potential big data application

domains: health care, public sector admin-istration, retail, global manufacturing, and personal location data.

Big-Data System Challenges:

Designing and deploying a big data analytics system is not a trivial or straightforward task. As one of its de nitions

suggests, big data is beyond the capability of current hard-ware and software platforms. The new hardware and software

platforms in turn demand new infrastructure and models to address the wide range of challenges of big data. Recent

works have discussed potential obstacles to the growth of big data applications. In this paper, we strive to classify these

challenges into three categories: data collection and management, data analytics, and system issues.

Data collection and management addresses massive amounts of heterogeneous and complex data. The following

challenges of big data must be met:

Data Representation: Many datasets are heterogeneousin type, structure, semantics, organization, granular-ity, and

accessibility. A competent data presentation should be designed to re ect the structure, hierarchy, and diversity of the

data, and an integration technique should be designed to enable ef cient operations across different datasets.

Redundancy Reduction and Data Compression: Typi-cally, there is a large number of redundant data in raw

datasets. Redundancy reduction and data compression without scarifying potential value are ef cient ways to lessen

overall system overhead.


August- 2016, pp. 368-381


Data Life-Cycle Management: Pervasive sensing andcomputing is generating data at an unprecedented rate and

scale that exceed much smaller advances in storage system technologies. One of the urgent challenges is that the current

storage system cannot host the massive data. In general, the value concealed in the big data depends on data freshness;

therefore, we should set up the data importance principle associated with the analysis value to decide what parts of the

data should be archived and what parts should be discarded.

Data Privacy and Security: With the proliferation ofonline services and mobile phones, privacy and security

concerns regarding accessing and analyzing personal information is growing. It is critical to understand what support for

privacy must be provided at the platform level to eliminate privacy leakage and to facilitate var-ious analyses.

There will be a signi cant impact that results from advances in big data analytics, including interpretation, mod-eling,

prediction, and simulation. Unfortunately, massive amounts of data, heterogeneous data structures, and diverse

applications present tremendous challenges, such as the following.

Approximate Analytics: As data sets grow and the real-time requirement becomes stricter, analysis of the entire

dataset is becoming more dif cult. One way to poten-tially solve this problem is to provide approximate results, such as

by means of an approximation query. The notion of approximation has two dimensions: the accuracy of the result and the

groups omitted from the output.

Connecting Social Media: Social media possessesunique properties, such as vastness, statistical redun-dancy and

the availability of user feedback. Various extraction techniques have been successfully used to identify references from

social media to speci c product names, locations, or people on websites. By connect-ing inter- eld data with social media,

applications can achieve high levels of precision and distinct points of view.

Deep Analytics: One of the drivers of excitementaround big data is the expectation of gaining novel insights.

Sophisticated analytical technologies, such as machine learning, are necessary to unlock such insights. However,

effectively leveraging these analysis toolkits requires an understanding of probability and statistics. The potential pillars

of privacy and security mechanisms are mandatory access control and security communication, multi-granularity access

control, privacy-aware data mining and analysis, and security storage and management.

Finally, large-scale parallel systems generally confront sev-eral common issues; however, the emergence of big data

has ampli ed the following challenges, in particular.

Energy Management: The energy consumption of large-scale computing systems has attracted greater concern

from economic and environmental perspectives. Data transmission, storage, and processing will inevitably consume

progressively more energy, as data volume and analytics demand increases. Therefore, system-level power control and

management mechanisms must be considered in a big data system, while continuing to provide extensibility and

accessibility.

Scalability: A big data analytics system must be able tosupport very large datasets created now and in the future. All

the components in big data systems must be capable of scaling to address the ever-growing size of complex datasets.

Collaboration: Big data analytics is an interdisciplinaryresearch eld that requires specialists from multiple

professional elds collaborating to mine hidden val-ues. A comprehensive big data cyber infrastructure is necessary to

allow broad communities of scientists and engineers to access the diverse data, apply their respec-tive expertise, and

cooperate to accomplish the goals of analysis.

V. CONCLUSIONS

In this paper we have surveyed various technologies to handle the big data and there architectures. In this paper we

have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a

disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage,

real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main

goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data

from different sources and improves overall performance of systems.

REFERENCES

[1] Yuri Demchenko ―The Big Data Architecture Framework (BDAF)‖ Outcome of the Brainstorming Session at

the University of Amsterdam 17 July 2013.

[2] Amogh Pramod Kulkarni, Mahesh Khandewal, ―Survey on Hadoop and Introduction to YARN‖, International

Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO

9001:2008 Certified Journal, Volume 4, Issue 5, May 2014).

[3] Sagiroglu, S.Sinanc, D.,‖Big Data: A Review‖,2013, 20-24.

[4] Ms. Vibhavari Chavan, Prof. Rajesh. N. Phursule, ―Survey Paper On Big Data‖ International Journal of

Computer Science and Information Technologies, Vol. 5 (6), 2014.

[5] Margaret Rouse, April 2010―unstructured data‖.

[6] Kyuseok Shim, MapReduce Algorithms for Big Data Analysis, DNIS 2013, LNCS 7813, pp. 44–48, 2013.

[7] Dong, X.L.; Srivastava, D. Data Engineering (ICDE),‖ Big data integration― IEEE International Conference on ,

29(2013) 1245–1248.

http://www.ijetae.com/


August- 2016, pp. 368-381


[8] Tekiner F. and Keane J.A., Systems, Man and Cybernetics (SMC), ―Big Data Framework‖ 2013 IEEE

International Conference on 13–16 Oct. 2013, 1494–1499.

[9] Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta, Kumar N ―Analysis of Big Data using Apache Hadoop

and Map Reduce‖ Volume 4, Issue 5, May 2014‖.

[10] Suman Arora, Dr.Madhu Goel, ―Survey Paper on Scheduling in Hadoop‖ International Journal of Advanced

Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014.

[11] Aditya B. Patel, Manashvi Birla and Ushma Nair, "Addressing Big Data Problem Using Hadoop and Map

Reduce," in Proc. 2012 Nirma University International Conference On Engineering.

[12] Jimmy Lin ―Map Reduce Is Good Enough?‖ The control project, IEEE Computer 32 (2013).

survey paper on big data - ijarcsseijarcsse.com/before_august_2017/docs/papers/volume_6/8... ·...

Documents