hadoop based weblog analysis: a reviewhadoop architecture with hadoop distributed file system...

18
International Journal of Software Engineering and Its Applications Vol. 10, No. 6 (2016), pp. 13-30 http://dx.doi.org/10.14257/ijseia.2016.10.6.02 ISSN: 1738-9984 IJSEIA Copyright ⓒ 2016 SERSC Hadoop based Weblog Analysis: A Review Pooja D. Savant 1 , Debnath Bhattacharyya 2 , Tai-hoon Kim 3 1 Dept of Information Technology, Bharati Vidyapeeth Deemed University College of Engineering, Pune-411043, India 2 Department of Computer Science and Engineering, Vignan’s Institute of Information Technology, Visakhapatnam-530049, India 3 Department of Convergence Security, Sungshin Women's University, 249-1, Dongseon-dong 3-ga, Seoul, 136-742, Korea (Corresponding Author) [email protected], [email protected], [email protected] Abstract The growth of websites and the Internet has opened up new research, social, entertainment, education and business opportunities. With the fast growth of the Internet, the digital data generated by the websites is becoming so massive that the traditional text software and relational database technology faces a bottleneck while processing such massive data and the results generated by these technologies are not satisfactory. Cloud computing offers a good solution for this problem. Cloud computing is not only capable of storing such massive data but also capable of processing and analyzing such voluminous data faster, by making use of distributed storage and distributed computing technology. A weblog is a group of connected web pages that consists of a log or daily record of information, particular fields or views which is altered, every now and then, by owner of site, other websites or by website users. An enterprise weblog analysis system based on Hadoop architecture with Hadoop Distributed File System (HDFS), Hadoop MapReduce Software Framework and Pig Latin Language aids the business decision- making process of the system administrators and helps them to collect and identify the potential value which is hidden within such huge data generated by the websites. Such a weblog analysis includes the analysis of an Internet site’s entry log as well as provides information about the amount of visitors, days of week and rush hours, views, hits, very often accessed pages, application server traffic trends, performance reports at varying intervals and statistical reports which indicate the performance of program. Keywords: Cloud computing, Hadoop Framework, Distributed File System, MapReduce Software Framework, Pig Latin Language 1. Introduction The fast development of the Internet has led to increased usage of the Internet in people's daily life, which has led to accelerated growth of web logs. This has posed various problems as regards to handling of the weblogs in a timely manner, extracting of the information required by the people from the massive weblogs generated. A single computer cannot handle weblog satisfactorily to meet the needs of the people. A Hadoop- based Weblog Analysis System combines Cloud Computing and Hadoop technology processes all the gathered logs as well as carries out a distant parallelization study that solves the difficulties of conventional setups such as simultaneous data handling and

Upload: others

Post on 15-Mar-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016), pp. 13-30

http://dx.doi.org/10.14257/ijseia.2016.10.6.02

ISSN: 1738-9984 IJSEIA

Copyright ⓒ 2016 SERSC

Hadoop based Weblog Analysis: A Review

Pooja D. Savant1, Debnath Bhattacharyya

2, Tai-hoon Kim

3

1Dept of Information Technology,

Bharati Vidyapeeth Deemed University College of Engineering,

Pune-411043, India 2Department of Computer Science and Engineering,

Vignan’s Institute of Information Technology,

Visakhapatnam-530049, India 3Department of Convergence Security,

Sungshin Women's University,

249-1, Dongseon-dong 3-ga, Seoul, 136-742, Korea

(Corresponding Author)

[email protected], [email protected], [email protected]

Abstract

The growth of websites and the Internet has opened up new research, social,

entertainment, education and business opportunities. With the fast growth of the Internet,

the digital data generated by the websites is becoming so massive that the traditional text

software and relational database technology faces a bottleneck while processing such

massive data and the results generated by these technologies are not satisfactory. Cloud

computing offers a good solution for this problem. Cloud computing is not only capable

of storing such massive data but also capable of processing and analyzing such

voluminous data faster, by making use of distributed storage and distributed computing

technology. A weblog is a group of connected web pages that consists of a log or daily

record of information, particular fields or views which is altered, every now and then, by

owner of site, other websites or by website users. An enterprise weblog analysis system

based on Hadoop architecture with Hadoop Distributed File System (HDFS), Hadoop

MapReduce Software Framework and Pig Latin Language aids the business decision-

making process of the system administrators and helps them to collect and identify the

potential value which is hidden within such huge data generated by the websites. Such a

weblog analysis includes the analysis of an Internet site’s entry log as well as provides

information about the amount of visitors, days of week and rush hours, views, hits, very

often accessed pages, application server traffic trends, performance reports at varying

intervals and statistical reports which indicate the performance of program.

Keywords: Cloud computing, Hadoop Framework, Distributed File System,

MapReduce Software Framework, Pig Latin Language

1. Introduction

The fast development of the Internet has led to increased usage of the Internet in

people's daily life, which has led to accelerated growth of web logs. This has posed

various problems as regards to handling of the weblogs in a timely manner, extracting of

the information required by the people from the massive weblogs generated. A single

computer cannot handle weblog satisfactorily to meet the needs of the people. A Hadoop-

based Weblog Analysis System combines Cloud Computing and Hadoop technology

processes all the gathered logs as well as carries out a distant parallelization study that

solves the difficulties of conventional setups such as simultaneous data handling and

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

14 Copyright ⓒ 2016 SERSC

gathering of data. A Hadoop-based Weblog Analysis System also reduces the congestion

efficiency of computing power as well as depository volume, saves a majority of the

processing time and greatly improves the effectiveness.

Various pitfalls observed in previous weblog analysis systems are as mentioned:

I. The growth of Internet has resulted into a considerable rise in the amount of logs due

to which log processing velocity of existing systems is extremely low.

II. The size of depository is inadequate for the huge logs because of the high escalation

in the quantity of logs everyday and the predefined size of hard disk belonging to the

server [4].

III. It is not possible to execute the collection and analysis of weblogs at the same time.

A Hadoop-based Weblog Analysis System employs a Hadoop cluster system which is

based upon the Cloud Environment. It is able to satisfactorily find a solution for the

problems stated overhead like a lower rate of performing operations, inadequate

depository volume. It can also perform gathering as well as study of the weblogs

simultaneously.

A Hadoop-based Weblog Analysis System is designed and implemented based on the

Hadoop architecture with Hadoop Distributed File System (HDFS), Hadoop MapReduce

Software Framework and Pig Latin Language.

Across the part mentioned below, we briefly discuss historical development about

Hadoop-based Weblog Analysis and also discuss the various steps to perform a Hadoop-

based Weblog Analysis.

1.1. History of Weblog Analysis using Hadoop

As the quick development of internet technology and digital data in recent years is

showing explosive growth, mobile devices and cloud computing environment is

generating huge data logs. Traditional text software’s and relational database technology

has been facing bottleneck and the presented results are not satisfactory. It is an

acknowledged and ongoing problem that the existing systems are incapable to handle

these huge data logs and consequently fails to find out the hidden trends within this huge

data [1].

In today’s world of Internet, analysis of log file is becoming essential for studying a

client’s actions for boosting the promotions as well as purchases. In order to analyze the

log data, we obtain required knowledge from this log data with the help of Web mining.

Log files are being produced extremely quickly with a speed ranging from about 1 to 10

Mb/s for every device. In one day, one data center is able to produce tens of terabytes of

log data. The size of the sets of data is very large. We need a parallel processing system

as well as a dependable data storage mechanism to perform the analysis of such huge

datasets. The Hadoop framework gives us dependable data storage with the help of

Hadoop Distributed File System as well as MapReduce calculating standard that acts as a

parallel operating setup over huge sets of data. A Hadoop distributed file system splits the

initial data as well as provides the initial data fragments across various computers across

HDFS cluster. These machines across the hadoop cluster carry blocks of data which

enables the processing log data in parallel and evaluates the result efficiently. The

superior hadoop methodology is to “Save initially and query afterwards”. Initially,

Hadoop puts entire data over the Hadoop Distributed File System. After this is

completed, Hadoop executes the queries which are in Pig Latin language which enables

to lessen the time for reply and the load against the user setup [2]. Pig is modeled to

blend well within explanatory methodology of SQL as well as the procedure-oriented

map-reduce methodology associated with either the machine-code or an assembly

language which is inflexible, resulting into a large amount of tailor-made consumer

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 15

computer program that proves difficult for managing as well as reutilizing. After being

completely executed, Pig performs the task of compiling Pig Latin in the form of

concrete designs. These concrete designs get implemented across Hadoop. Hadoop is a

publicly accessible, implementation of map-reduce. An Apache-incubator project such as

Pig is open-source and accessible for public usage. Pig drastically lessens the time which

is needed for performing generation as well as implementation related to their data study

activities, in contrast with the time taken when Hadoop is utilized alone. A new

debugging environment is obtained when Hadoop is integrated with Pig which results in

a very high efficiency leading to increased profitability [3].

1.2. Various Steps to Perform a Weblog Analysis using Hadoop

The various steps to perform a Weblog Analysis using Hadoop are as follows:

1.2.1. Data Pre-processing: All the records to be processed are dispersed across the

different Application Servers. The Weblogs spread across various Application Servers

undergo preliminary collation through Linux FTP protocol which is integrated into a log

file [1].

1.2.2. Upload: After the files finish cutting, the designated block size is uploaded into the

HDFS specified directories through various commands [1].

1.2.3. Hadoop processing: The data is not changed after the Weblog file is stored in

HDFS. The system functions are divided into two categories such as batch analysis and

interactive input conditions. The main purpose of batch analysis is to estimate traffic

statistics and conduct multi-dimensional analysis. The main purpose of Interactive

function is to analyze specific business and a specific range [1].

System administrator develops a basic program structure which consists of a shell script

frame:

(i) To receive different parameters according to the functional needs and

(ii) To call the Pig script which loads log and copies the analysis result that is to be

restored across various Linux directories [1].

1.2.4. Analysis: System administrator creates various dimensions analysis reports

according to the results generated such as sales distribution, Application server flow and

data, etc. [1].

2. Literature Review

In this section, we attempt to identify the various indicators derived through Weblog

Analysis and list out various kinds of Web Analytics Software. We also discuss the

Hadoop Framework, Hadoop Distributed File System (HDFS), MapReduce Software

Framework and Pig Programming Language.

2.1. Various Indicators derived through Weblog Analysis

When we perform analysis of a Web log, a server log file is transferred through an

Internet server. Depending upon the amounts that are present inside the log file, the most

prevalent signs such as the amount of web-site access by users as well as the amount

associated with distinct visitors, period of visit as well as final visit, customers who are

authenticated as well as last authenticated web-site access, information about week-days

as well as which are the hours at which the web-site traffic is the heaviest, what are the

domains or countries of the visitors of the host, list of hosts, what are the number of page

views, information about which are the highest viewed, entry, and exit pages, what are

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

16 Copyright ⓒ 2016 SERSC

the types of file used, which OS is used, which browsers are used, which robots are used,

referrer of HTTP, search engines which are used, which are the important group of words

as well as reserved words utilized for identifying the Internet site which is analyzed,

which HTTP errors are identified, information about who is currently visiting the site,

tracking of conversion, time of visit as well as page navigation can be obtained which

enables us to perform analysis of when, how, and by whom an Internet server is accessed.

Results can be produced with the help of log files promptly, or reports can be

generated on demand through a database containing the log files.

2.2. Hadoop

Hadoop is a publicly accessible platform which is utilized with the aim of storing and

extracting an accurate and deep understanding from huge volumes of data. The Apache

Software foundation is responsible for the development of the Hadoop project. The

Apache Software foundation develops open-source software for performing computations

which are dependable, adaptive to increased demands and distributed in nature.

Apache Hadoop is a suite of data and programming code that is used to develop

software programs and applications. It is a platform which enables to perform operations

on the huge sets of data in a distributed manner over clusters of machines utilizing easier

to understand programming standards. It is intended for increasing proportionally from

one server to a large number of computers. Every computer is equally capable in

providing local calculation as well as depository. The Apache Hadoop suite of data and

programming code used to develop software programs and applications is not reliant

upon the physical components of a computer to provide a system that is continuously

operational for a desirably long length of time. The generally available collection of

Hadoop programs and software packages is planned in such a way as to identify and deal

with the breakdowns across application layer which results in a serviceability that is

continuously operational for a desirably long length of time upon the cluster of machines,

with every machine across this cluster equally likely to suffer from failures.

The project contains the modules as mentioned:

a) Hadoop Distributed File System (HDFS)

b) Hadoop MapReduce

c) Pig

2.3. Hadoop Distributed File System (HDFS)

HDFS provides a means for greater rate of processing over the data belonging to a

program. Hadoop File System which was created by utilizing a DFS design is run over

affordable and easy to obtain physical components of a computer. HDFS is different from

other distributed systems due to its great fault-resilient nature and its design which

utilizes cheap hardware.

HDFS carries huge data as well as gives a simpler access mechanism. To save so

much large amount of data, the files are saved throughout numerous computers. The

acquired files are saved in repeating style to save the system against probable failure of

data. HDFS makes programs ready for managing in a parallel manner.

2.3.1. HDFS Architecture: The master-slave architecture is followed by HDFS and it

consists of the elements as shown in Figure 1.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 17

Figure 1. Architecture of HDFS

Name Node: A Name Node carries metadata for HDFS. The Name Node saves and

maintains information about the position of the files in HDFS. This information

maintained by the Name Node is needed for fetching data spread within the hadoop

cluster across many computers. The Name Node is software that can be run on computer

hardware that is affordable and easy to obtain.

The Namenode system functions as the main server as well as performs jobs as

mentioned:

a) Management of the file system namespace.

b) Regulation of the client’s access to files.

c) Execution of the file or directory setup operations like rename, close, and open.

Data node: A Data node is a component of a computer that is affordable and easy to

obtain that executes the duty of saving HDFS files. A Data node is present for each node

across a hadoop cluster. The Data nodes handle the process of data-storage across HDFS.

They handle the blocks which contain parts of the file at a node. They send the file as

well as block information saved within a particular node and provide reply to the Name

Node for the entire file system operations. Data nodes execute reading and writing

transactions upon file setups, according to user demand. Data nodes carry out the

activities like creating a block, deleting a block, as well as replicating a block as per

Name Node command.

Block: The saving of client data is done inside HDFS files. The file across a file setup

is split into a single or multiple fragments and/or saved over distinct Data nodes. Each of

the file fragments is a Block. A Block is the minimal data quantity which is conducive for

reading and writing by HDFS. By default, 64 MB happens to be the block size. It is

possible to expand it, if we wish to implement a modification while configuring HDFS.

2.3.2. Features of HDFS: Various features of HDFS are given below:

a) Hadoop as well as HDFS, is pertinent for the purpose of implementing distributed

depository and performing distributed operations using a computer hardware that is

affordable and easy to obtain. Both are fault resilient, scalable, and very easy to expand.

b) HDFS is highly configurable. It has a default configuration which is applicable for

more number of installations. The configuration requires tuning only for very large

clusters.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

18 Copyright ⓒ 2016 SERSC

c) HDFS periodically executes new features and improvements. HDFS includes a few

useful features as File permissions and authentication, Rack awareness which considers

the physical placement of a node during task-planning and providing a depository,

Safemode which is administrative mode with the aim of providing maintenance, fsck

which is a system software for maintenance that helps to identify the nature about state

associated with the file setup, for identifying the files/blocks which are missing, fetchdt

which is a system software for maintenance that enables the fetching of a

DelegationToken and storage of it in a file on the local setup, Rebalancer tool that

performs the task of balancing the cluster when the data is not distributed evenly across

the Data Nodes, post a software upgradation, it is feasible to go back to HDFS condition

prior to the upgradation if unforeseen problems arise through upgrade and rollback and

Secondary Name Node which carries out regular namespace check pointing and enables

to retain volume associated with the file consisting HDFS log of alterations, inside

specific boundaries across Namenode.

2.4. Hadoop MapReduce

Hadoop MapReduce is a system which is YARN-based for performing the parallel

processing of huge data sets. As described in Figure 2, MapReduce is a framework which

helps us to write applications to process large quantities of data, side-by-side, across huge

clusters of computer components that are affordable and easy to obtain in a dependable

manner. MapReduce is a technique for performing processing as well as a programming

standard that enables distributed processing which has its base in java. A MapReduce

procedure for solving a problem consists of a couple of significant activities, i.e., Map as

well as Reduce. Map accepts a collection of data and transforms it in the form of a

different collection of data, in which distinct constituents are split in the form of tuples

(key and value combinations). After this is done, reduce stage accepts the result derived

through map in the form of an input as well as does the task of combining the key and

value combinations of data into a mini-collection of key and value combinations. The

order of word MapReduce indicates that the reduce stage is executed after the map stage

every single time. A most significant benefit gained by using MapReduce lies in it being

simple to increase proportionally the performing of data operations across the various

processing nodes. As per MapReduce standard, various data handling natives are known

as mappers as well as reducers. Breaking down a data operation performing program in

the form of mappers as well as reducers is not always significant. As soon as a program is

written in the format of MapReduce, scaling the program to execute over a very large

number of devices over a cluster requires simply an alteration within the configuration.

Such kind of easy scalability allures a large number of programmers to utilize the

MapReduce model.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 19

Figure 2. Architecture of MapReduce Software Framework

2.4.1. The MapReduce Algorithm: A MapReduce prototype is based upon sending the

logic of Map-Reduce codes to each Data node across the cluster. MapReduce program is

implemented through three phases such as mapping, shuffling and lastly reducing:

Map phase: A map function performs operations on initial data to be processed. This

initial unprocessed data undergoes saving across Hadoop file system (HDFS) as per

structure of a file or directory. This file containing initial unprocessed data is sent into

map method line by line. This map method performs required operations on the data and

generates many small data-fragments.

Reduce phase: This phase brings together Shuffle phase as well as Reduce phase. A

Reducer performs operations upon data obtained through mapper. Once the required

operations are performed, a novel collection of output is created and saved in the HDFS.

The Map Reduce operation is described in Figure 3. When a MapReduce job is

executing, Hadoop passes Map and Reduce activities toward suitable servers across

hadoop cluster. This Hadoop platform is what performs handling associated with all the

data-sending details like allocating of activities, checking whether these activities are

finished, as well as performing the copy of data over cluster amidst nodes. A majority

computation happens over the nodes containing data on local disks lessening congestion

across network. Once the given activities are done, the cluster gathers the data as well as

reduces the data in the format of a suitable outcome, as well as returns outcome to server

of Hadoop.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

20 Copyright ⓒ 2016 SERSC

Figure 3. Map-Reduce Operation

2.4.2. Inputs as well as Outputs from Java Context: The MapReduce platform

functions over <key, value> combinations, i.e., this platform sees input to a job in the

form of a collection of <key, value> combination as well as creates a collection of <key,

value> combinations which forms desired job output, perceivably belonging to distinct

kinds.

It is essential that the key as well as value classes be acquired in a serialized way by

Hadoop platform. The key as well as value classes can be acquired in a serialized way by

implementing a Writable interface. The key classes are required to implement the

Writable-Comparable interface for making the sort operation easier for the platform.

Input as well as Output kinds of a MapReduce job are (Input) <k1, v1> -> map -> <k2,

v2>-> reduce -> <k3, v3> (Output).

2.4.3. Applications of Map Reduce: The various applications of Map Reduce are as

mentioned below:

Geospatial Query Processing: As technology exhibits an advance in location-reliant

serviceability, a large boost is observed across quantity of geospatial data. Geospatial

queries (nearest neighbor queries as well as reverse most near neighbor queries) utilize a

large number of calculation resources. It is seen that the performing of operations across

geospatial data is made intrinsically parallel [6].

Google Maps utilizes MapReduce for finding the solution of problems such as those

mentioned below:

i. For a specific intersection, identify all the roads connecting to it.

ii. Perform the rendering of tiles in the map.

iii. For a specific address or current location, identify the nearest feature to it.

If we contemplate a search in Google Maps to find the gas stations which are near to

the Arizona State University, the input to a MapReduce program would be a graph

representing a network of nodes with all the marked gas stations. The Map function

would search within a specific radius (say 5 miles) of every gas station as well as mark

the distance to every node. After performing sorting by the key, the Reduce function

would emanate the path and gas station which is nearest to every node. The final output is

a graph which is marked with the gas station nearest to every node as described in Figure

4.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 21

Figure 4. Search in Google maps using Map-Reduce

Map Reduce for Gridding LIDAR data: Digital Elevation Models are digital or 3D

depiction of the landscape, where every (X, Y) location is depicted with the help of some

distinct amount of elevation. DEMs are mentioned as Digital Terrain Model (DTM) or

Digital Surface Model (DSM). A DEM is able to be depicted in the form of lattice

consisting of squares or in the form of triangular irregular network (TIN). It is able to be

created by remote sensing through satellites or through information regarding the land

elevation that is scrutinized precisely. DEMs can be utilized across a variety of scientific

as well as engineering programs such as creating a hydrologic pattern, terrain study, as

well as framework plan. The rudimentary handling jobs across OpenTopography system

is the production of DEMs with the help of extremely thick (multiple measurements for

each square meter) LIDAR (Light Detection and Ranging) topography data. The usage of

map-reduce for the gridding of LIDAR data is described in Figure 5.

Figure 5. Use of Map-Reduce for Gridding LIDAR Data

The local gridding method uses details regarding altitude through Light Detection and

Ranging measurements consisted inside a circular area of search to calculate altitude of

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

22 Copyright ⓒ 2016 SERSC

every lattice cell. This execution is performed over entire data side-by-side making it the

perfect use case for MapReduce.

During Map stage, input points undergo allocation across respective lattice cells (local

bins), as well as during Reduce stage respective altitudes across every lattice cell get

calculated by the local bins. These reduced outcomes undergo merging, sorting as well as

actual DEM which gets produced by ArcASCII lattice form known as ESRI ASCII grid

format. The Map Reduce implementation for local gridding method in hadoop is depicted

within Figure 6.

Figure 6. Map-Reduce Implementation of Local Gridding Method in Hadoop

Number of URL Access Frequency: A map method performs the processing of logs

of web page demands as well as generates outcomes <URL, 1>. A reduce method

performs task of adding to each other all the amounts for identical URL and emanates a

<URL, total count> combination [7].

Reverse Web-Link Graph: Map method generates outcomes <target, source>

combinations associated with every link towards a destination URL, identified in a

webpage called as "source". Reduce method performs concatenations upon the list of

entire base URLs connected with a specific destination URL and emanates the

combination: <target, list (source)> [7].

Term-Vector for each Host: The term vector sums up the highest significant words

which occur within a single document or a collection of documents, in the form of a list

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 23

of <word, frequency> combinations. Map method emanates a <hostname, term vector>

combination that associated with every input document (in which the name of host is

obtained from the document URL). Reduce method is sent entire every-document term

vectors associated with specific host. The Reduce method performs the addition of the

received term vectors collectively; getting rid of rare terms, as well as emanates

afterwards a concluding <hostname, term vector> combination [7].

Inverted Index: Map method examines every document, as well as emanates chain

associated with <word, document ID> combinations. Reduce method takes entire

collection of such combinations associated with specific word, performs sorting of the

respective document IDs and emanates a <word, list (document ID)> combination. A

collection of entire collection of outcome combinations takes the format of an easy to

understand inverted index. Reinforcing such a calculation to retain a trace about the word

locations is a simple task [7].

2.5. Pig Programming Language

Pig is a high-level data-flow language as well as an implementation framework for

performing parallel computation. Apache Pig provides an abstraction over MapReduce. It

is a tool or platform which is utilized to perform the analysis of huge data-sets depicting

them as data flows. Pig is utilized along with Hadoop; we can implement all the data

manipulation operations in Hadoop with the usage of Apache Pig. Pig offers a language

which is closer to human languages called as Pig Latin which enables us to write the data

analysis programs independent of a particular type of computer. This language bestows

different operators utilizing which the programmers can develop their own functions for

the purpose of reading, writing as well as processing of the data. Programmers need to

write scripts using Pig Latin language to perform the analysis of data with the usage of

Apache Pig. All these scripts are intrinsically transformed into Map and Reduce tasks.

Apache Pig has a component which is called as Pig Engine, takes the Pig Latin scripts as

input as well as transforms those scripts into MapReduce jobs.

2.5.1. Need for Apache Pig: The programmers, who are not proficient at Java, generally

underwent a lot of strife while working with Hadoop, particularly during the execution of

any MapReduce tasks. Apache Pig is a blessing for all such programmers. With the usage

of Pig Latin, programmers can execute the MapReduce tasks in a simple manner without

the need to type complex codes in Java. Apache Pig utilizes a multi-query methodology,

which lessens the length of codes. For example, an operation that would need us to type

200 lines of code (LoC) in Java can be completed in a simple manner by typing as few as

only 10 LoC in Apache Pig. Apache Pig lessens the time required for development by

almost 16 times. Pig Latin language is similar to SQL language. It is also very simple to

grasp Apache Pig once we are acquainted with SQL. Apache Pig bestows numerous

built-in operators that assist the implementation of data operations such joins, filters,

ordering, etc. Apache Pig gives us nested data types such as tuples, bags as well as maps

that are absent in MapReduce.

2.5.2. Architecture for Apache Pig: The language utilized to perform the analysis of

data in Hadoop using Apache Pig is called as Pig Latin. Pig Latin language is closer to

human languages for writing programs to process data independent of a computer-type,

which gives us a rich set of data types as well as operators to implement different data

operations. Programmers utilize Pig to execute a specific task. Programmers have to

write a Pig script utilizing the Pig Latin language, and implement them with the help of

any of the implementation methods (Grunt Shell, UDFs, as well as Embedded). After

implementation, these scripts will undergo a range of conversions enforced by the Pig

Framework, to generate the expected output. Intrinsically, Apache Pig transforms these

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

24 Copyright ⓒ 2016 SERSC

scripts into a range of MapReduce jobs, which, in turn, makes the programmer’s job very

simple.

The architecture of Apache Pig is described in Figure 7.

Figure 7. Architecture of Apache Pig

2.5.3. Components of Apache Pig: As shown in Figure.7, there are different constituents

in the Apache Pig framework.

The major constituents of Apache Pig are as mentioned:

Parser: At the beginning, the Pig Scripts are managed by the Parser. It inspects the

syntax of the script, performs type checking, as well as other different checks. The output

of the parser will be a DAG (directed acyclic graph), which depicts the Pig Latin

statements as well as logical operators. In the DAG, the logical operators of the script are

depicted as the nodes and the data flows are depicted as the edges.

Optimizer: The logical plan (DAG) is sent to the logical optimizer, which performs

the logical optimizations like projection and pushdown.

Compiler: The compiler performs the compilation of a reasonable design which is

optimized in the form of a range of MapReduce jobs.

Execution engine: MapReduce jobs are then given to Hadoop in a sorted form. These

MapReduce jobs are then implemented on Hadoop generating the required results.

2.5.4. Applications of Apache Pig: Apache Pig is utilized by data scientists for

implementing the tasks that include processing for a specific purpose as well as fast

prototyping.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 25

Apache Pig is utilized to implement the tasks as mentioned:

a) Processing of large sources of data like web logs.

b) Performing the processing of data for search platforms.

c) Processing of the data loads that are time-sensitive.

3. Results and Comparisons

In this section we list out and compare various functions, advantages and limitations of

Hadoop Distributed File System (HDFS) in Table 1, MapReduce Software Framework in

Table 2 and Pig Programming Language in Table 3. We also list out the comparison

between Apache Pig and SQL in Table 4 as well as the comparison between Apache Pig

and MapReduce in Table 5.

Table 1. Functions, Advantages and Limitations of Hadoop Distributed File System

Functions Advantages Limitations

Maintains transaction logs

to monitor each operation

and execute the efficient

auditing and data recovery,

if something does not go

right [9].

Appropriate for distributed

storage and for performing

processing.

Programs that require

access to data with less

time-interval between

stimulation and response,

i.e., in the tens of milli-

seconds series, do not work

properly with HDFS [10].

Utilizes an efficient error-

detection technique like

validation of checksum for

the verifying the content of

a file [9].

Hadoop also bestows the

command-interface so that

it is feasible to

communicate with HDFS.

For HDFS files, there is no

provisioning for more than

one writer or for performing

alterations across random

displacements in any file

[10].

Maintains the copies of

data blocks that are

replicated to avert

corruption of a file due to

server failure [9].

The built-in servers of

Name Node and Data node

enable the users to inspect

easily the cluster status in a

very simple manner.

HDFS contains a re-

balancer service which

performs the balancing of

data load across the Data

nodes [9].

Gives streaming approach

to data of a file system.

The replication factor can

be reduced or raised

through configuration

settings that permit the

failure of node without data

loss.

Gives file permissions as

well as authentication.

The block placement of

HDFS utilizes rack-

awareness for fault-

tolerance through the

placement of one block

replica on a distinct rack.

Permits data replication that

provides flexibility to data

in the event of an

unforeseen disaster or

corruption of any data

location contents [9].

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

26 Copyright ⓒ 2016 SERSC

Gives adaptability and it is

very easy to extend.

Aids various O.S. and gives

same performance on all of

them [9].

Table 2. Functions, Advantages and Limitations of MapReduce Software Framework

Functions Advantages Limitations

MapReduce helps for

performing the processing

and analysis of unstructured

as well as semi-structured

data that is obtained from

various sources more

efficiently [9].

MapReduce can handle the

issues such as parsing of the

text, crunching through of

the Web logs as well as

reading from large raw data

source, more efficiently

than the databases that are

relational [9].

If the calculation of a value

relies on the preceding

calculated values, then

MapReduce cannot be

utilized.

MapReduce also helps for

computational handling of

data saved in a file system

without the need for loading

the data at the beginning,

into a database [9].

MapReduce can crop up the

surplus data while loading

it in the database [9].

MapReduce not being a

database, it does not

contain specific database

features like indexing,

giving security, optimizing

a query or process, etc. [9].

MapReduce offers

operations like map and

reduce which take place

side-by-side across a

collection of worker nodes

[9].

MapReduce can be utilized

as an Extract, Load and

Transform (ETL) tool

which read a collection of

data resource, carry out

process of formatting or

reshuffling of the steps and

assign the responsibility for

generated outputs to

destination source of data

[9].

MapReduce describes the

data type that is generated

in each process which

makes everything including

data-structures custom-

coded [9].

MapReduce works on a

master/worker methodology

in which the master process

commands and conducts the

whole activity like

collection, segregation as

well as delegation of the

data across various workers

[9].

MapReduce performs the

processing of a Big Data

source as well as sums it up

in a more meaningful way

[9].

MapReduce considers each

job to be an entity in itself,

it lacks any knowledge

about the other processes

which are getting

implemented at the same

time [9].

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 27

Table 3. Functions, Advantages and Limitations of Pig Programming Language

Functions Advantages Limitations

Pig gives a high level

abstraction for executing

Map Reduce.

Pig is similar to SQL which

makes it seem friendly to

developers who already

have the knowledge about

SQL. This results into a

rapid development and

reduces the time is needed

for the same.

Pig is not appropriate for

all data handling jobs [10].

Gives the programmers

independence from the

Hadoop or MapReduce java

API details.

Pig looks after the

fundamentals of performing

the connection of jobs and

data flow

When we wish to execute a

query which deals solely

with scarce data across a

huge collection of data, Pig

fails to execute in a proper

manner, as it happens to be

designed with the aim of

inspecting the entire set of

data, or at least huge parts

of it [10].

Pig facilitates developers for

focusing on their app or

dataflow.

Pig contains multiple basic

operations on data such as

JOIN, GROUP, FILTER

etc.

Various features enabling

on the Internet queries that

require less time-interval

between stimulation and

response in RDBMSs are

not there in Apache Pig,

such as transactions as well

as indexes [10].

Pig contains nested types of

data.

Pig can function on any

data type.

Pig fails to enable arbitrary

reading or querying across

the range of tens of

milliseconds [10].

Pig can function without

Schema or it can also utilize

a Flexible Schema.

Pig gives flexibility of data

format with the help of

UDFs.

Pig fails to enable arbitrary

writing for modifying

minute data parts; every

write is a voluminous

streaming write [10].

Pig contains of Built-in as

well as User Defined

Functions (UDFs).

Pig is expandable by users. Pig does not enable queries

that show less time-interval

between stimulation and

response [10].

Gives help for External

Binaries.

Pig enables a higher

development as well as

runtime speed.

Pig supports a procedural

step-by–step

implementation.

Pig enables parallelism

which makes it beneficial to

look after extremely huge

sets of data [9].

The loading of data happens Pig is utilized as an Extract,

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

28 Copyright ⓒ 2016 SERSC

in Pig before the mapper

runs, for this, Pig bestows a

split operation with the help

of which the input can be

divided into parts that are

individually carried out by

every mapper [10].

Load, and Transform (ETL)

tool for Hadoop due to

which Hadoop becomes

more accessible as well as

user-friendly for non-

technical persons [9].

Pig is designed for

performing the batch-

processing of data [10].

The high level of

abstraction of Apache Pig

results into a higher degree

of execution independence.

Pig Latin functions as a data

flow programming

language.

Pig has the capability to

perform the processing

across terabytes of data

very easily, through the

issuing of a half-dozen Pig

Latin lines through the

console.

Pig gives commands to

communicate with the

Hadoop file-systems which

are extremely useful for

manipulating the data

around, prior to processing

as well as post processing.

This is performed using Pig

as well as MapReduce and

commands related to utility.

With the utilization of the

Pig Latin language, the

programmers can execute

the MapReduce tasks in a

very simple manner,

without the need to type

complex codes in Java.

Apache Pig utilizes a multi-

query methodology, which

lessens the length of codes.

For example, an operation

that would need the

developers to type 200 lines

of code (LoC) in Java can

be completed in a very

simple manner, by typing as

few as only 10 LoC in

Apache Pig. Apache Pig

lessens the development

time by almost 16 times.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

Copyright ⓒ 2016 SERSC 29

Table 4. Comparison between Apache Pig and sql

Apache Pig SQL

Pig Latin behaves as a procedural language. SQL behaves as a declarative language.

The schema is available as a choice in

Apache Pig. We are able to store data

without designing a schema (values are

stored as $01, $02 etc.)

Schema is compulsory in SQL.

Apache Pig gives a data model that is nested

relational.

The data model which is utilized in SQL

is flat relational.

Apache Pig gives a confined opportunity for

Query optimization.

SQL gives infinite opportunity for query

optimization.

Table 5. Comparison between Apache Pig and MapReduce

Apache Pig MapReduce

Apache Pig behaves as a data flow

language.

MapReduce behaves as a data processing

archetype.

Apache Pig acts as a high level language. MapReduce acts as a low level as well as

an inflexible language.

Executing a Join operation in Apache Pig is

extremely easy.

It is very problematic in MapReduce to

execute a Join operation between the sets

of data.

Any amateur programmer with a

fundamental knowledge of SQL can

function efficiently with Apache Pig.

Experience in Java is mandatory to

implement MapReduce.

Apache Pig utilizes a multi-query

methodology, which lessens the length of

the codes to a higher degree.

MapReduce will need almost 20 times

more the number of lines to implement a

similar task.

There is no requirement for performing the

compilation. Upon implementation, each

Apache Pig operator is transformed

intrinsically into a MapReduce job.

The compilation process of MapReduce

jobs is long.

4. Conclusion

The execution of Weblog analysis system based on Hadoop indicates that the

distributed architecture of Hadoop deems it appropriate for handling large files [1].

Hadoop gives a better performance than shell commands, when the file-size is above

512MB [1]. Being a high-level programming language, Pig bestows many functions for

performing the analysis of data to fulfill the varying requirements of system analysis [1].

5. Future Scope

The cloud computing research is progressing extremely fast to put together Cloud

Computing with Hadoop architecture containing Hadoop Distributed File System

(HDFS), Hadoop MapReduce Software Framework as well as the Pig Latin Language to

expose novel trends for performing more productive as well as more quick collection,

storage and analysis of huge weblogs produced by a huge number of Internet users on a

daily basis. We hope to see better requirement analysis of logs and a better performance

by such a Weblog Analysis System based on Hadoop that gives a reference for doing

performance-tuning that enables us to foretell the future trend of a system.

International Journal of Software Engineering and Its Applications

Vol. 10, No. 6 (2016)

30 Copyright ⓒ 2016 SERSC

References

[1] C. H. Wang, C. T. Tsai, C. C. Fan and S. M. Yuan, “A Hadoop-based Weblog Analysis System”,

International Conference on Ubi-Media Computing and Workshops (UMEDIA), Ulaanbaatar,

Mongolia, (2014) July 12-14, pp. 72 – 77.

[2] S. Narkhede and T. Baraskar, “HMR Log Analyzer: Analyze Web Application Logs Over Hadoop

Mapreduce”, International Journal of UbiComp (IJU), vol. 4, no. 3, (2013) July, pp. 41-51.

[3] C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tumkins, “Pig Latin: A Not-So-Foreign Language

for Data-Processing”, SIGMOD -2008, Vancouver, Canada, (2008) June 9-12, pp. 1099-1110.

[4] Y. Q. Wei, G. G. Zhou, D. Xu and Y. Chen, “Design of the Web Log Analysis System Based on

Hadoop”, Advanced Materials Research, vol. 926-930, (2014) May, pp. 2474-2477.

[5] http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/

[6] A. Akdogan, U. Demiryurek, F. Banaei-Kashani and C. Shahabi, “Voronoi-Based Geospatial Query

Processing with MapReduce”, International Conference on Cloud Computing Technology and Science

(CloudCom), Indianapolis, IN, (2010) Nov. 30 - Dec. 3, pp. 9-16.

[7] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM

Conference on Symposium, on Opearting Systems Design & Implementation, vol. 6, (2004) December

6, pp. 10.

[8] Z. Ma and L. Gu, “The limitation of MapReduce: A probing case and a lightweight solution”,

International Conference on Cloud Computing, GRIDs, and Virtualization, (2010), pp. 68-73.

[9] D. T. Editorial Services, in Big Data (Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data

Visualization) Black Book, dreamtech Press Publishers, New Delhi (2015).

[10] T. White, in Hadoop: The Definitive Guide, Third Edition, O’Reilly Media, Inc. Publishers, United

States of America (2012).