3 approaches to big data analysis with apache hadoop
TRANSCRIPT
54 2014 Issue 01 | Dell.com/powersolutions
Features
Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved.
Log files from web servers represent a
treasure trove of data that enterprises
can mine to gain a deep understanding
of customer shopping habits, social
media use, web advertisement effectiveness and
other metrics that inform business decisions.
Each click from a web page can create on the
order of 100 bytes of data in a typical website
log. Consequently, large websites handling
millions of simultaneous visitors can generate
hundreds of gigabytes or even terabytes of
logs per day. Ferreting out nuggets of valuable
information from this mass of data can require
very sophisticated algorithms.
To analyze big data, many organizations turn
to open-source utilities found in the Apache
Hadoop ecosystem. The choice of a particular
tool depends on the needs of the analysis, the skill
set of the data analyst, and the trade-off between
development time and execution time.
Three commonly used tools for analyzing data
resident in Apache HDFS™ (Hadoop Distributed
File System) include the Hadoop MapReduce
framework, Apache Hive™ data warehousing
software and the Apache Pig™ platform.
MapReduce requires a computer program —
often written in the Oracle® Java® programming
language — to read, analyze and output data.
Hive provides a SQL-like front end well suited
for analysts with a database background,
who view data in terms of tables and joins.
And the Pig platform includes a high-level
language for data processing that enables the
analyst to exploit the parallelism inherent in a
Hadoop cluster.
Understanding website visitors
through log file analysis
To compare the performance of the three
tools, Dell engineers created a program for
each tool that tackled a simple log file analysis
task: measuring the amount of traffic coming
to the website by country of origin on an
hour-by-hour basis during an average day.
(For more information, see the sidebar,
“Configuration details.”)
The test analyzed files in the standard
Apache HTTP Server log format (see figure). The
first component of the log file is the remote IP
address, which the programs used to determine
the host country. The programs parsed only the
first two octets of the IP address and looked
them up in a table derived from GeoLite data
created by MaxMind. The table, contained
in a space-separated file all_classbs.txt, listed
the Class B addresses used exclusively by a
single country along with the country code.
The hour of the visit can be extracted from the
time stamp, which is the second component
of the log file.
The data used in the test was generated by a
MapReduce program, the GeoWeb Apache Log
3 approaches to big data analysis with Apache Hadoop
The Apache™ Hadoop® ecosystem provides a rich source
of utilities that are key to helping enterprises unlock vital
insights from large data sets. Discover how three powerful
data analysis tools match up.
By Dave Jaffe
Dell.com/powersolutions | 2014 Issue 01 55
Features
Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved.
Generator.1 The program was designed to
produce realistic sequential Apache web
logs for a specified month, day, year and
number of clicks per day. Remote hosts
were distributed geographically among
the top 20 internet-using countries2 and
temporally so that each region was most
active during its local evening hours,
simulating a consumer or social website.
Since the synthetic web logs created for
the test represent just the top 20 internet-
using countries, the output of each program
consisted of 480 keys (20 countries over
24 hours), each associated with a value
representing the total number of hits from
that country during that hour.
MapReduce: Parallel processing
for large data sets
The MapReduce framework provides a
flexible, resilient and efficient mechanism
for distributed computing over large
server clusters. MapReduce coordinates
distributed servers to run various tasks in
parallel: map tasks that read data stored in
HDFS and emit key-value pairs; combiner
tasks that aggregate the values for each
key being emitted by a mapper; and
reducer tasks that process the values for
each key. Writing a MapReduce program
is a direct way to exploit the capabilities
of this framework for data manipulation
and analysis.
Designing the MapReduce program to
perform the geographical web analysis was
fairly straightforward.3 The mapper read log
files from HDFS one line at a time and parsed
the first two octets of the remote IP address
as well as the hour of web access. It then
looked up the country corresponding to that
IP address from a table generated from the
all_classbs.txt file and emitted a key — with
a value of 1 — comprising the country code
and hour. The combiner and reducer added
all the values per key and wrote 24 keys per
detected country to HDFS, each with a value
corresponding to the total number of hits
coming from that country in that hour across
the whole set of log files.
Hive: Data warehouse infrastructure
for ad hoc queries
The Apache Hive tool projects structure
onto data stored in HDFS and also
Components in a standard Apache web log file
User agentStatus / size / referrerRequest lineDate / time stamp
172.16.3.1 - - [27/Jun/2012:17:48:34 -0500] "GET /favicon.ico HTTP/1.1" 404 298 "http://110.240.0.17" "Mozilla/5.0 …"
Remotehost
1 Visit github.com/DaveJaffe/BigDataDemos to view more information and complete code for the GeoWeb Apache Weblog Generator tool.
2 Top 20 countries determined from 2011 Wikipedia data.
3 Visit github.com/DaveJaffe/BigDataDemos to view complete code listings of the MapReduce program, which comprises the GeoWeb.java driver, the GeoWebMapper.java mapper and the SumReducer.java combiner and reducer.
Dive deeper
Download this white paper for an in-depth exploration of the geographical and temporal analysis of Apache web logs using MapReduce, Hive and Pig. The appendices include code listings for the programs used in the analysis.
qrs.ly/af3tmsi
56 2014 Issue 01 | Dell.com/powersolutions
Features
Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved.
provides a SQL interface, HiveQL (HQL), to
query that data. Hive creates a query plan that
implements HQL in a series of MapReduce
programs, generates the code for these
programs and then executes the code.
The Hive program first defined the HDFS
data in terms of SQL tables.4 Both the web logs
and the mapping information contained in
all_classbs.txt were turned into HQL tables.
Once the data tables were defined, the program
read the web log data and parsed it using
a serializer-deserializer provided by Hive,
RegexSerDe. The results were grouped by
country code and hour, and the count of each
combination was generated. Additional formatting
was performed so that the output would
resemble that of the MapReduce program.
Pig: High-level data flow framework
for parallel computation
Apache Pig provides a data flow language, Pig
Latin, that enables the user to specify reads,
joins and other computations without the need
to write a MapReduce program. Like Hive, Pig
generates a sequence of MapReduce programs
to implement the data analysis steps.
The Pig program loaded the web logs and
Class B IP address data from all_classbs.txt into
two Pig data bags, or relations.5 Then the program
parsed the web logs to extract the first two octets
of the IP address and the hour from the time
stamp. The two items, or tuple, were saved in
a data bag and then joined with the IP address
information, which was subsequently stored in
a different data bag. The data bag was then
Configuration detailsIn October 2013, Dell engineers at the Dell Solution Center in Round Rock, Texas, tested
programs written using MapReduce from Apache Hadoop 1.03, Apache Hive 0.9.0 and
Apache Pig 0.11.1. The programs ran on a cluster that was based on the Intel® Distribution for
Apache Hadoop version 2.4.1.*
The cluster’s name nodes and edge node ran on Dell PowerEdge R720 servers, each
with two eight-core Intel® Xeon® processors E5-2650 at 2.0 GHz, 128 GB of memory and
six 600 GB Serial Attached SCSI (SAS) disks in a RAID-10 configuration. The 20 data nodes
were PowerEdge R720xd servers with two eight-core Intel Xeon processors E5-2650 at
2.0 GHz, 64 GB of memory and twenty-four 500 GB Serial ATA (SATA) disks, each in a
RAID-0 configuration. The total raw disk space on the cluster was over 200 TB. The servers
were connected through Dell Networking S60 switches and Gigabit Ethernet (GbE) network
interface cards.
A set of 366 Apache web log files, one for each day of 2012, was created by the GeoWeb
Apache Weblog Generator tool and stored in Hadoop Distributed File System (HDFS). Each day
consisted of 11,900,000 log entries. The total size occupied by the log files was 1 TB. A second
set of log files was created with 119,000,000 entries per day, for a total size of 10 TB.
* The implementation of the Intel Distribution on Dell PowerEdge servers, including generalized cluster configuration parameters, is described in the Dell white paper “Intel Distribution for Apache Hadoop On Dell PowerEdge Servers,” available at qrs.ly/xg3tmsg.
4 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.q Hive program.
5 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.pig Pig program.
Dell.com/powersolutions | 2014 Issue 01 57
Features
Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved.
grouped by the country code, hour tuple and
the number counted for each group. The result
was ordered by country code and hour and then
stored back in HDFS.
Pig compiled the data flow into MapReduce,
resulting in a multi-pass MapReduce program. For
each of the key steps — JOIN, GROUP, ORDER —
a PARALLEL option may be specified as a hint to
MapReduce to determine the number of map or
reduce tasks to deploy for that step.
Finally, the Dell team formatted the output to
yield a result identical to that obtained from the
MapReduce and Hive programs.
Comparison of program performance
The Dell team ran the MapReduce, Hive and Pig
programs sequentially against the 1 TB set of log
files. The total number of hits per country over
the year and over the course of a 24-hour period
was calculated.
These percentages matched the input
distribution, indicating that the parsing and
processing of the IP address table and time
stamp information worked properly for all three
programs. Because the same IP address data was
used to generate as well as analyze the remote
IP addresses, 100 percent of the log entries
successfully matched a country in this test. In a
real-world scenario, however, the percentage of
matches is expected to be lower.
The team then ran the three programs against the
10 TB set of log files, and compared the performance
to that of the 1 TB set of files (see figure).
Selecting the right tool for the job
To explore utilities available in the Hadoop
ecosystem, Dell engineers used the MapReduce,
Hive and Pig programs to analyze files in the
standard Apache HTTP Server log format. The
algorithms created by the Dell team can be
adapted easily to other log formats.
As might be expected, the MapReduce
program performed the best for both sets
of log files tested, because it is a single
program explicitly written for the MapReduce
framework. The Hive and Pig programs,
which generate multiple MapReduce
programs to accomplish the same task,
took longer to execute.
However, the performance difference was
less pronounced with the larger data set size,
indicating that the overhead of running multiple
batch jobs in Hive and Pig had less impact on
longer-running batch jobs. Moreover, all three
programs showed excellent scalability; the large
data set took less than 10 times as much time
to analyze compared to the data set that was
10 times as small.
These results demonstrated the trade-off
between development time and execution time.
Hive and Pig programs are usually quicker to
develop but take longer to run than MapReduce
programs, with less of a disadvantage for larger
workloads. In the end, enterprises can effectively
leverage all three approaches to harness the
potential of big data for making informed
business decisions.
ToolTime to analyze 1 TB of web logs
Time relative to MapReduce
Time to analyze 10 TB of web logs
Time relative to MapReduce
Scaling 10 TB vs. 1 TB workload
MapReduce 7 min 40 sec 1x 69 min 54 sec 1x 9.12x
Hive 9 min 34 sec 1.25x 74 min 59 sec 1.07x 7.4x
Pig 20 min 57 sec 2.73x 183 min 54 sec 2.63x 8.78x
Comparison of program performance, in terms of elapsed time to analyze log data
Dell and PowerEdge are trademarks of Dell Inc.
Learn more
Apache Hadoop:
hadoop.apache.org
Apache Hive:
hive.apache.org
Apache Pig:
pig.apache.org
Author
Dave Jaffe is a solution architect for
Dell Solution Centers.