weblog analysis

Weblog Analysis on AWS EMR

Weblog Analysis Case StudyThis case study is to process huge amount of web log to identify http error code return from server to predict potential attack. The http return can be found in the 2nd last column as highlighted in red below. We are going to extract the row with http return code greater than 300.

Source CodeSource code can be downloaded from my GitHub repository below:

https://github.com/yapweiyih/webloganalysis

Data SourceInput Folders: Web log and JAR file

Output Folders:

Cluster Configuration

Hardware Configuration

Specify Class name, input file to be processed, and output log directory

Step Configuration

Specify Class name, input file to be processed, and output log directory

Create Cluster

Click ‘Create Cluster’ and wait for setup to complete

EC2 Dashboard

The cluster created can also be viewed in your EC2 dashboard.

You can verify whether the cluster has been terminated over here to avoid unnecessary bill charges

Output log

3 rows of output which shows the web log details and return code of 304:

burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0304

129.94.144.152 - - [01/Jul/1995:00:00:17 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0304

burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0304

Depending on business requirement, the output can also be formulated to count the number of invalid access based on domain name:

burger.letters.com 2 129.94.144.152 1

Debugging Log

Debugging log can be found on AWS EMR as well

Alternative Dev ToolsIf you do not want to spend money on AWS, you can also deploy your JAR and input source to Hortonworks or Cloudera Sandbox.

Hortonworks Distributionhttp://hortonworks.com/products/hortonworks-sandbox/#install

Cloudera Distributionhttp://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

Data Workflow DesignAssuming data from different only comes at certain timing, you can use AWS CLI, and schedule your task using Linux cron job. Alternative, you can make use of AWS Data Pipeline. However, currently it’s not available Singapore region.

Input file exist?

Network

Social

Sensors

Start

End

Start EC2 Instance

Move file to EMR input directory

Process input files using EMR

AWS Data PipelineUsing AWS Data Pipeline workflow to achieve the same objective:

weblog analysis

Documents

log debugging log

output log directory

input source

aws emr

data source input folders

aws data pipeline workflow

use of aws data pipeline

source code source code