weblog analysis
TRANSCRIPT
Weblog Analysis on AWS EMR
Weblog Analysis Case StudyThis case study is to process huge amount of web log to identify http error code return from server to predict potential attack. The http return can be found in the 2nd last column as highlighted in red below. We are going to extract the row with http return code greater than 300.
Source CodeSource code can be downloaded from my GitHub repository below:
https://github.com/yapweiyih/webloganalysis
Data SourceInput Folders: Web log and JAR file
Output Folders:
Cluster Configuration
Hardware Configuration
Specify Class name, input file to be processed, and output log directory
Step Configuration
Specify Class name, input file to be processed, and output log directory
Create Cluster
Click ‘Create Cluster’ and wait for setup to complete
EC2 Dashboard
The cluster created can also be viewed in your EC2 dashboard.
You can verify whether the cluster has been terminated over here to avoid unnecessary bill charges
Output log
3 rows of output which shows the web log details and return code of 304:
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0304
129.94.144.152 - - [01/Jul/1995:00:00:17 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0304
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0304
Depending on business requirement, the output can also be formulated to count the number of invalid access based on domain name:
burger.letters.com 2 129.94.144.152 1
Debugging Log
Debugging log can be found on AWS EMR as well
Alternative Dev ToolsIf you do not want to spend money on AWS, you can also deploy your JAR and input source to Hortonworks or Cloudera Sandbox.
Hortonworks Distributionhttp://hortonworks.com/products/hortonworks-sandbox/#install
Cloudera Distributionhttp://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html
Data Workflow DesignAssuming data from different only comes at certain timing, you can use AWS CLI, and schedule your task using Linux cron job. Alternative, you can make use of AWS Data Pipeline. However, currently it’s not available Singapore region.
Input file exist?
Network
Social
Sensors
Start
End
Start EC2 Instance
Move file to EMR input directory
Process input files using EMR
AWS Data PipelineUsing AWS Data Pipeline workflow to achieve the same objective: