3 approaches to big data analysis with apache hadoop

54 2014 Issue 01 | Dell.com/powersolutions

Features

Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved.

Log files from web servers represent a

treasure trove of data that enterprises

can mine to gain a deep understanding

of customer shopping habits, social

media use, web advertisement effectiveness and

other metrics that inform business decisions.

Each click from a web page can create on the

order of 100 bytes of data in a typical website

log. Consequently, large websites handling

millions of simultaneous visitors can generate

hundreds of gigabytes or even terabytes of

logs per day. Ferreting out nuggets of valuable

information from this mass of data can require

very sophisticated algorithms.

To analyze big data, many organizations turn

to open-source utilities found in the Apache

Hadoop ecosystem. The choice of a particular

tool depends on the needs of the analysis, the skill

set of the data analyst, and the trade-off between

development time and execution time.

Three commonly used tools for analyzing data

resident in Apache HDFS™ (Hadoop Distributed

File System) include the Hadoop MapReduce

framework, Apache Hive™ data warehousing

software and the Apache Pig™ platform.

MapReduce requires a computer program —

often written in the Oracle® Java® programming

language — to read, analyze and output data.

Hive provides a SQL-like front end well suited

for analysts with a database background,

who view data in terms of tables and joins.

And the Pig platform includes a high-level

language for data processing that enables the

analyst to exploit the parallelism inherent in a

Hadoop cluster.

Understanding website visitors

through log file analysis

To compare the performance of the three

tools, Dell engineers created a program for

each tool that tackled a simple log file analysis

task: measuring the amount of traffic coming

to the website by country of origin on an

hour-by-hour basis during an average day.

(For more information, see the sidebar,

“Configuration details.”)

The test analyzed files in the standard

Apache HTTP Server log format (see figure). The

first component of the log file is the remote IP

address, which the programs used to determine

the host country. The programs parsed only the

first two octets of the IP address and looked

them up in a table derived from GeoLite data

created by MaxMind. The table, contained

in a space-separated file all_classbs.txt, listed

the Class B addresses used exclusively by a

single country along with the country code.

The hour of the visit can be extracted from the

time stamp, which is the second component

of the log file.

The data used in the test was generated by a

MapReduce program, the GeoWeb Apache Log

3 approaches to big data analysis with Apache Hadoop

The Apache™ Hadoop® ecosystem provides a rich source

of utilities that are key to helping enterprises unlock vital

insights from large data sets. Discover how three powerful

data analysis tools match up.

By Dave Jaffe

Dell.com/powersolutions | 2014 Issue 01 55

Features


Generator.1 The program was designed to

produce realistic sequential Apache web

logs for a specified month, day, year and

number of clicks per day. Remote hosts

were distributed geographically among

the top 20 internet-using countries2 and

temporally so that each region was most

active during its local evening hours,

simulating a consumer or social website.

Since the synthetic web logs created for

the test represent just the top 20 internet-

using countries, the output of each program

consisted of 480 keys (20 countries over

24 hours), each associated with a value

representing the total number of hits from

that country during that hour.

MapReduce: Parallel processing

for large data sets

The MapReduce framework provides a

flexible, resilient and efficient mechanism

for distributed computing over large

server clusters. MapReduce coordinates

distributed servers to run various tasks in

parallel: map tasks that read data stored in

HDFS and emit key-value pairs; combiner

tasks that aggregate the values for each

key being emitted by a mapper; and

reducer tasks that process the values for

each key. Writing a MapReduce program

is a direct way to exploit the capabilities

of this framework for data manipulation

and analysis.

Designing the MapReduce program to

perform the geographical web analysis was

fairly straightforward.3 The mapper read log

files from HDFS one line at a time and parsed

the first two octets of the remote IP address

as well as the hour of web access. It then

looked up the country corresponding to that

IP address from a table generated from the

all_classbs.txt file and emitted a key — with

a value of 1 — comprising the country code

and hour. The combiner and reducer added

all the values per key and wrote 24 keys per

detected country to HDFS, each with a value

corresponding to the total number of hits

coming from that country in that hour across

the whole set of log files.

Hive: Data warehouse infrastructure

for ad hoc queries

The Apache Hive tool projects structure

onto data stored in HDFS and also

Components in a standard Apache web log file

User agentStatus / size / referrerRequest lineDate / time stamp

172.16.3.1 - - [27/Jun/2012:17:48:34 -0500] "GET /favicon.ico HTTP/1.1" 404 298 "http://110.240.0.17" "Mozilla/5.0 …"

Remotehost

1 Visit github.com/DaveJaffe/BigDataDemos to view more information and complete code for the GeoWeb Apache Weblog Generator tool.

2 Top 20 countries determined from 2011 Wikipedia data.

3 Visit github.com/DaveJaffe/BigDataDemos to view complete code listings of the MapReduce program, which comprises the GeoWeb.java driver, the GeoWebMapper.java mapper and the SumReducer.java combiner and reducer.

Dive deeper

Download this white paper for an in-depth exploration of the geographical and temporal analysis of Apache web logs using MapReduce, Hive and Pig. The appendices include code listings for the programs used in the analysis.

qrs.ly/af3tmsi

http://github.com/DaveJaffe/BigDataDemos


http://qrs.ly/af3tmsi

56 2014 Issue 01 | Dell.com/powersolutions

Features


provides a SQL interface, HiveQL (HQL), to

query that data. Hive creates a query plan that

implements HQL in a series of MapReduce

programs, generates the code for these

programs and then executes the code.

The Hive program first defined the HDFS

data in terms of SQL tables.4 Both the web logs

and the mapping information contained in

all_classbs.txt were turned into HQL tables.

Once the data tables were defined, the program

read the web log data and parsed it using

a serializer-deserializer provided by Hive,

RegexSerDe. The results were grouped by

country code and hour, and the count of each

combination was generated. Additional formatting

was performed so that the output would

resemble that of the MapReduce program.

Pig: High-level data flow framework

for parallel computation

Apache Pig provides a data flow language, Pig

Latin, that enables the user to specify reads,

joins and other computations without the need

to write a MapReduce program. Like Hive, Pig

generates a sequence of MapReduce programs

to implement the data analysis steps.

The Pig program loaded the web logs and

Class B IP address data from all_classbs.txt into

two Pig data bags, or relations.5 Then the program

parsed the web logs to extract the first two octets

of the IP address and the hour from the time

stamp. The two items, or tuple, were saved in

a data bag and then joined with the IP address

information, which was subsequently stored in

a different data bag. The data bag was then

Configuration detailsIn October 2013, Dell engineers at the Dell Solution Center in Round Rock, Texas, tested

programs written using MapReduce from Apache Hadoop 1.03, Apache Hive 0.9.0 and

Apache Pig 0.11.1. The programs ran on a cluster that was based on the Intel® Distribution for

Apache Hadoop version 2.4.1.*

The cluster’s name nodes and edge node ran on Dell PowerEdge R720 servers, each

with two eight-core Intel® Xeon® processors E5-2650 at 2.0 GHz, 128 GB of memory and

six 600 GB Serial Attached SCSI (SAS) disks in a RAID-10 configuration. The 20 data nodes

were PowerEdge R720xd servers with two eight-core Intel Xeon processors E5-2650 at

2.0 GHz, 64 GB of memory and twenty-four 500 GB Serial ATA (SATA) disks, each in a

RAID-0 configuration. The total raw disk space on the cluster was over 200 TB. The servers

were connected through Dell Networking S60 switches and Gigabit Ethernet (GbE) network

interface cards.

A set of 366 Apache web log files, one for each day of 2012, was created by the GeoWeb

Apache Weblog Generator tool and stored in Hadoop Distributed File System (HDFS). Each day

consisted of 11,900,000 log entries. The total size occupied by the log files was 1 TB. A second

set of log files was created with 119,000,000 entries per day, for a total size of 10 TB.

* The implementation of the Intel Distribution on Dell PowerEdge servers, including generalized cluster configuration parameters, is described in the Dell white paper “Intel Distribution for Apache Hadoop On Dell PowerEdge Servers,” available at qrs.ly/xg3tmsg.

4 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.q Hive program.

5 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.pig Pig program.

http://qrs.ly/xg3tmsg



Dell.com/powersolutions | 2014 Issue 01 57

Features


grouped by the country code, hour tuple and

the number counted for each group. The result

was ordered by country code and hour and then

stored back in HDFS.

Pig compiled the data flow into MapReduce,

resulting in a multi-pass MapReduce program. For

each of the key steps — JOIN, GROUP, ORDER —

a PARALLEL option may be specified as a hint to

MapReduce to determine the number of map or

reduce tasks to deploy for that step.

Finally, the Dell team formatted the output to

yield a result identical to that obtained from the

MapReduce and Hive programs.

Comparison of program performance

The Dell team ran the MapReduce, Hive and Pig

programs sequentially against the 1 TB set of log

files. The total number of hits per country over

the year and over the course of a 24-hour period

was calculated.

These percentages matched the input

distribution, indicating that the parsing and

processing of the IP address table and time

stamp information worked properly for all three

programs. Because the same IP address data was

used to generate as well as analyze the remote

IP addresses, 100 percent of the log entries

successfully matched a country in this test. In a

real-world scenario, however, the percentage of

matches is expected to be lower.

The team then ran the three programs against the

10 TB set of log files, and compared the performance

to that of the 1 TB set of files (see figure).

Selecting the right tool for the job

To explore utilities available in the Hadoop

ecosystem, Dell engineers used the MapReduce,

Hive and Pig programs to analyze files in the

standard Apache HTTP Server log format. The

algorithms created by the Dell team can be

adapted easily to other log formats.

As might be expected, the MapReduce

program performed the best for both sets

of log files tested, because it is a single

program explicitly written for the MapReduce

framework. The Hive and Pig programs,

which generate multiple MapReduce

programs to accomplish the same task,

took longer to execute.

However, the performance difference was

less pronounced with the larger data set size,

indicating that the overhead of running multiple

batch jobs in Hive and Pig had less impact on

longer-running batch jobs. Moreover, all three

programs showed excellent scalability; the large

data set took less than 10 times as much time

to analyze compared to the data set that was

10 times as small.

These results demonstrated the trade-off

between development time and execution time.

Hive and Pig programs are usually quicker to

develop but take longer to run than MapReduce

programs, with less of a disadvantage for larger

workloads. In the end, enterprises can effectively

leverage all three approaches to harness the

potential of big data for making informed

business decisions.

ToolTime to analyze 1 TB of web logs

Time relative to MapReduce

Time to analyze 10 TB of web logs

Time relative to MapReduce

Scaling 10 TB vs. 1 TB workload

MapReduce 7 min 40 sec 1x 69 min 54 sec 1x 9.12x

Hive 9 min 34 sec 1.25x 74 min 59 sec 1.07x 7.4x

Pig 20 min 57 sec 2.73x 183 min 54 sec 2.63x 8.78x

Comparison of program performance, in terms of elapsed time to analyze log data

Dell and PowerEdge are trademarks of Dell Inc.

Learn more

Apache Hadoop:

hadoop.apache.org

Apache Hive:

hive.apache.org

Apache Pig:

pig.apache.org

Author

Dave Jaffe is a solution architect for

Dell Solution Centers.

http://hadoop.apache.org

http://hive.apache.org

http://pig.apache.org

3 approaches to big data analysis with apache hadoop

Technology