clickstream data
DESCRIPTION
How to manage DATA with HADOOPTRANSCRIPT
![Page 1: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/1.jpg)
Commercial Analytics of Clickstream Data using
Hadoop
Submitted to:School of Mathematics and Computer Application Department,Thapar university,Patiala.
Submitted by:Kartik Gupta201100048M.C.A Thapar University
June 2014
![Page 2: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/2.jpg)
Outline
OverviewBig DataHadoopMajor StepsResults and AnalysisConclusion and Future Scope
![Page 3: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/3.jpg)
OverviewThis Project gives an analytic report to find the behavior
and location of visitor using Hadoop. Map Reduce is implemented to refine and sort the raw
data.Searching is done based on the country, ip addresses,
Postal code, categories wiseHadoop is a tool which converts the unstructured,
structured and semi-structured data into pair into a single value which is represented in binary format.
MapReduce framework is used for parallel implementation.
![Page 4: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/4.jpg)
Big Data
Big Data is a term used to describe large collections of data that may be unstructured grow so large and quickly that it is difficult to manage with regular database or statistical tools.
3 v’s of Big data
![Page 5: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/5.jpg)
Hadoop
Open source project started by Doug Cutting A platform to manage Big Data Helps in Distributed computing Runs on Commodity HardwareData storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant
![Page 6: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/6.jpg)
CORE PARTS OF HADOOP
![Page 7: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/7.jpg)
Hadoop Distributed File System(HDFS) Hadoop Distributed File System (HDFS) is a Java-based
file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Some specific features ensure that the Hadoop clusters are highly functional RackAwareness Minimal Data Motion Utilities Rollback Highly Operable
![Page 8: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/8.jpg)
How HDFS works
![Page 9: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/9.jpg)
MapReduce
MapReduce is a programming model and an associated implementation for processing large data sets.
MapReduce usually splits the input data-set into independent chunks which are processed in a completely parallel manner.
This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
The run-time system takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
![Page 10: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/10.jpg)
10
Execution flow in MapReduce
1. Mapreduce program that has been written tells the job client to run a mapreduce job.
![Page 11: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/11.jpg)
11
Execution flow in MapReduce
2.This sends a message to the Jobtracker which produces a unique ID for the job.
![Page 12: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/12.jpg)
12
Execution flow in MapReduce
3. JobClient copies job resources , such as jar file.
![Page 13: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/13.jpg)
13
Execution flow in MapReduce
4. Once the resources are in Distributed Filesystem, the JobClient can tell the JobTracker to start the job.
![Page 14: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/14.jpg)
14
Execution flow in MapReduce
5. The JobTracker does its own initialization for the job.. It retrieves these input splits from the distributed file system.
![Page 15: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/15.jpg)
15
Execution flow in MapReduce
6. Now that the Jobtracker has work for Tasktrackers, it will return the map task or reduce task as response to the heart beat.
![Page 16: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/16.jpg)
16
Execution flow in MapReduce
7. The TaskTracker need to obtain the code to execute, so they get it from the shared file system.
![Page 17: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/17.jpg)
17
Execution flow in MapReduce
8. The TaskTracker now will run the job.
![Page 18: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/18.jpg)
OTHER TECHNOLOGICAL TERMS
Clickstream Data Clickstream data is an information trail a user leaves behind while
visiting a website. It is typically captured in semi-structured website log files.
Potential Uses of Clickstream Data What is the most efficient path for a site visitor to research a product,
and then buy it? What products do visitors tend to buy together, and what are they
most likely to buy in the future? Where should I spend resources on fixing or enhancing the user
experience on my website?Basically we will focus on the “path optimization” use case. Specifically: how can we improve our website to reduce bounce rates and improve conversion?
![Page 19: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/19.jpg)
STEP I
Upload Acme website log dataset contains about 4 million rows of data, which represents five days of clickstream data.
![Page 20: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/20.jpg)
STEP II
Represent the dataset in unstructured format i.e timestamp, registerd user swid, ip address, geocoded ip address, url
![Page 21: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/21.jpg)
STEP III
Represent the users data from the unstructured loaddataset
![Page 22: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/22.jpg)
STEP IV
Represent the products categories wise from the dataset
![Page 23: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/23.jpg)
STEP V
Shows the refine dataset of acme logfiles
![Page 24: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/24.jpg)
STEP VI
Combine all the tables i.e acme log, products, users.
![Page 25: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/25.jpg)
Results and Analysis
Configuration of Hadoop
![Page 26: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/26.jpg)
Results and Analysis
Count the no of VISITORS from any country
![Page 27: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/27.jpg)
Results and Analysis
Retrieving the ip address and displaying the state of visitors
![Page 28: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/28.jpg)
Results and Analysis
Showing the no of ip access this category at a time
![Page 29: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/29.jpg)
Results and Analysis
Initial stage of mapping and reduction
![Page 30: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/30.jpg)
Results and Analysis
Category accessed by total no of ips
![Page 31: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/31.jpg)
Results and Analysis
Showing shoes category acc to state access by total no of ip
![Page 32: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/32.jpg)
Results and Analysis
showing details of ip accessed by visitors but gender wise
![Page 33: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/33.jpg)
Result and Analysis
No of Females accessed this page
![Page 34: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/34.jpg)
Result and Analysis
Total no of ip address accessed particular webpage
![Page 35: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/35.jpg)
Result and Analysis
Calculate the sum of ages of all the visitors
![Page 36: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/36.jpg)
Conclusion
The amount of clickstream data is rapidly growing and with this demand for accessing information over web has increased significantly.
Therefore analyze the behavior and location of the visitor.
It is inefficient to process large data using traditional sequential method
Therefore MapReduce is used for processing large datasets
![Page 37: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/37.jpg)
Future Scope
Clickstream information play an important role in a wide variety of applications such as decision support systems, profile-based marketing.
Location search is used by various industries like telecom , e-commerce industry , in event detection.
Nearest location method can be fused with any other method to help in better way for decision making.
Then the tradeoff would be done between distance and other factor that would be fused
![Page 38: Clickstream Data](https://reader030.vdocuments.net/reader030/viewer/2022012905/56d6bf8c1a28ab301696aa9e/html5/thumbnails/38.jpg)
Thank you !!!