weblog analysis

14
Weblog Analysis on AWS EMR

Upload: heavenode

Post on 17-Aug-2015

36 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Weblog Analysis

Weblog Analysis on AWS EMR

Page 2: Weblog Analysis

Weblog Analysis Case StudyThis case study is to process huge amount of web log to identify http error code return from server to predict potential attack. The http return can be found in the 2nd last column as highlighted in red below. We are going to extract the row with http return code greater than 300.

Page 3: Weblog Analysis

Source CodeSource code can be downloaded from my GitHub repository below:

https://github.com/yapweiyih/webloganalysis

Page 4: Weblog Analysis

Data SourceInput Folders: Web log and JAR file

Output Folders:

Page 5: Weblog Analysis

Cluster Configuration

Page 6: Weblog Analysis

Hardware Configuration

Specify Class name, input file to be processed, and output log directory

Page 7: Weblog Analysis

Step Configuration

Specify Class name, input file to be processed, and output log directory

Page 8: Weblog Analysis

Create Cluster

Click ‘Create Cluster’ and wait for setup to complete

Page 9: Weblog Analysis

EC2 Dashboard

The cluster created can also be viewed in your EC2 dashboard.

You can verify whether the cluster has been terminated over here to avoid unnecessary bill charges

Page 10: Weblog Analysis

Output log

3 rows of output which shows the web log details and return code of 304:

burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0304

129.94.144.152 - - [01/Jul/1995:00:00:17 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0304

burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0304

Depending on business requirement, the output can also be formulated to count the number of invalid access based on domain name:

burger.letters.com 2 129.94.144.152 1

Page 11: Weblog Analysis

Debugging Log

Debugging log can be found on AWS EMR as well

Page 12: Weblog Analysis

Alternative Dev ToolsIf you do not want to spend money on AWS, you can also deploy your JAR and input source to Hortonworks or Cloudera Sandbox.

Hortonworks Distributionhttp://hortonworks.com/products/hortonworks-sandbox/#install

Cloudera Distributionhttp://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

Page 13: Weblog Analysis

Data Workflow DesignAssuming data from different only comes at certain timing, you can use AWS CLI, and schedule your task using Linux cron job. Alternative, you can make use of AWS Data Pipeline. However, currently it’s not available Singapore region.

Input file exist?

Network

Social

Sensors

Start

End

Start EC2 Instance

Move file to EMR input directory

Process input files using EMR

Page 14: Weblog Analysis

AWS Data PipelineUsing AWS Data Pipeline workflow to achieve the same objective: