web mining

53
WEB MINING Presentation 1 CSE 590 DATA MINING Prof. Anita Wasilewska SUNY Stony Brook Presented By: Alka Simha 106677801 Avanthi Gupta 106616697 Megha Krishnamurthy 106616749

Upload: meerashekar

Post on 23-Nov-2014

118 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Web Mining

WEB MINING Presentation 1

CSE 590 DATA MININGProf. Anita Wasilewska

SUNY Stony Brook

Presented By:Alka Simha 106677801

Avanthi Gupta 106616697Megha Krishnamurthy 106616749

Page 2: Web Mining

REFERENCES

• Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber• Presentation Slides of Prof. Anita Wasilewska• http://en.wikipedia.org/wiki/Web_mining• http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf• http://searchcrm.techtarget.com/sDefinition/0,,sid11_gci789009,00.html• http://www.cs.rpi.edu/~youssefi/research/VWM/• http://www.galeas.de/webimining.html• R. Kosala. and H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations,

2(1):1-15, 2000.• R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide

web browsing patterns. Journal of Knowledge and Information Systems 1, 5-32, 1999• S. Chakrabarti, Data mining for hypertext: A tutorial survey. ACM SIGKDD

Explorations, 1(2):1-11, 2000System, 1(1), 1999• Mining the Web Discovering Knowledge from Hypertext Data - Soumen Chakrabarti• Y. S. Mareek and I. Z. B. Shaul. Automatically organizing bookmarks per contents.

Proc. Fifth International World Wide Web Conference, May 6-10 1996.

Page 3: Web Mining

OVERVIEW

• What is Web Mining• Challenges in Web Mining• Data Mining V/s Web Mining • Classification or Taxonomy • Applications of Web Mining• Conclusion

Page 4: Web Mining

What is Web Mining

• The web as we all know is the SINGLE largest source of data available.

• Web mining aims to extract and mine useful knowledge from the web.

• It is used to understand the customer behavior, evaluate the effectiveness of a website and also to help quantify the success of a marketing campaign.

Page 5: Web Mining

• Due to the large availability of data the world wide web, it has become very important for users to use automated tools to find the desired information resources.

• For example a user uses Google or Yahoo search for finding information.

• These factors thus give rise to the necessity of creating server and client side intelligent systems which can effectively mine for knowledge.

• The information gathered through the Web is further evaluated by using traditional data mining techniques such as clustering, classification and association.

Page 6: Web Mining

SEARCHING THE WEB

http://infolab.stanford.edu/~ullman/mining/2008/slides/web_mining_overview.pdf

Page 7: Web Mining

HOW BIG IS THE WEB

http://news.netcraft.com/archives/web_server_survey.html

224,749,695 (Mar 2009) Netcraft survey – Total no of sites across all domains

Page 8: Web Mining

CHALLENGES IN WEB MINING

• Finding useful and relevant information. • Creating knowledge from available information.• As the coverage of information is very wide and diverse, personalization

of the information is a tedious process.• Learning customer and individual user patterns.• Much of the web information is redundant, as the same piece of

information or its variant appears in many pages.• The web is noisy i.e. a page typically contains a mixture of many kinds

of information like, main content, advertisements, copyright notice, navigation panels.

• The web is dynamic, information keeps changing constantly. Keeping up with the changes and monitoring them are very important.

Page 9: Web Mining

• The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services.

• The most important challenge faced is Invasion of Privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, when it occurs without their knowledge or consent.

http://en.wikipedia.org/wiki/Web_mining

Page 10: Web Mining

USES OF WEB MINING

• This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes.

• The predicting capability of the mining application can benefit the society by identifying criminal activities.

• The companies can establish better customer relationship by giving them exactly what they need.

• Companies can understand the needs of the customer better and they can react to customer needs faster.

• The companies can find, attract and retain customers, they can save on production costs by utilizing the acquired insight of customer requirements.

• They can increase profitability by target pricing based on the profiles created.

• They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.

http://en.wikipedia.org/wiki/Web_mining

Page 11: Web Mining

WEB MINING vs DATA MININGSTRUCTUREData Mining Data is structured and has well defined tables, columns, rows, keys and constraints.Web Mining

Dynamic and rich in features and patterns.

• Web mining involves analysis of web server logs of a website whereas data mining involves using techniques to find relationships in large amounts of data.

• SPEEDOften need to react to evolving usage patterns in real time eg. Merchandizing.

http://www.information-management.com/news/5458-1.html

Page 12: Web Mining

WEB CRAWLERS

• A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot

• Search engines, use spidering as a means of providing up-to-date data

• Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

• Crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ; mueller{remove this}@cs.sunysb.edu

• A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier

Page 13: Web Mining

April 21, 2009 Web Mining 13

Page 14: Web Mining

WEB MINING TAXONOMYWeb Mining

Web Content Mining

Web Usage Mining

Web Structure Mining

Identify informationwithin given web pages

Distinguish personal home pages from other web pages

Infer knowledge from the World-Wide Web organization and the links between references and referents in the Web

Also known as Web Log Mining

Extract interesting patterns and trends in web access logs

Page 15: Web Mining

WEB CONTENT MINING•• DiscoveryDiscovery of useful information from web contents / data / documents

– Web data contents: text, image, audio, video, metadata and hyperlinks

• Pre-processing data before web content mining: feature selection

• Post-processing data can reduce ambiguous searching results

• Web Page Content Mining:– Mines the contents of documents directly

• Search Engine Mining:– Improves on the content search of other tools like search engines

• Web Content Mining is related to data mining and text mining– It is related to data mining because many data mining techniques can be

applied in Web content mining– It is related to text mining because much of the web content is text

Page 16: Web Mining

Issues in Web Content Mining• Developing intelligent tools for IR

– Finding keywords for key phrases– Discovering grammatical rules and collocations – Hypertext classification/categorization– Extracting key phrases from text documents– Learning extraction models/rules– Hierarchical clustering – Predicting words relationship

• Developing Web query systems– WebOQL, XML-QL

• Mining multimedia data– Mining image from satellite (Fayyad, et al. 1996) – Mining image to identify small volcanoes on Venus (Smyth, et al 1996)

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 17: Web Mining

WEB STRUCTURE MINING• The structure of a typical Web graph consists of Web pages as nodes, and

hyperlinks as edges connecting two related pages

• Web Structure Mining is the process of discovering information from the Web

• Finding information about the web pages and inference on Hyperlink

• Retrieving information about the relevance and the quality of the web page

• This type of mining can be performed either at the (intra-page) document level or at the (inter-page) hyperlink level

• Finding authoritative Web pages– Retrieving pages that are not only relevant but are also of high quality, or

authoritative on the topic

Page 18: Web Mining

WEB STRUCTURE MINING• Hyperlinks can infer the notion of authority

– The Web consists not only of pages, but also of hyperlinks pointing from one page to another

– These hyperlinks contain an enormous amount of latent human annotation– A hyperlink pointing to another Web page, this can be considered as the

author's endorsement of the other page

• To discover the link structure of the hyperlinks at the inter-document level and to generate structural summary about the Website and Web page:– Based on the hyperlinks, categorizing the Web pages and generated

information– Discovering the structure of Web document itself– Discovering the nature of the hierarchy or network of hyperlinks in the

Website of a particular domain

• The research at the hyperlink level is also called Hyperlink Analysis

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 19: Web Mining

WEB USAGE MINING• Web usage mining also known as Web log mining

• What is Usage mining?– Discovering user ‘navigation patterns’ from web data– Prediction of user behavior while he interacts with the web– Helps to improve large collection of resources

• Typical sources of data:– Automatically generated data stored in server access logs, referrer

logs, agent logs and client-side cookies– User profiles– Meta data: Page attributes, content attributes, usage data

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 20: Web Mining

WEB USAGE MINING

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 21: Web Mining

WEB USAGE MINING

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 22: Web Mining

WEB USAGE MINING• Applications:

– Target potential customers for electronic commerce– Enhance the quality and delivery of Internet information

services to the end user– Improve Web server system performance– Identify potential prime advertisement locations– Facilitates personalization of sites– Improve site design– Fraud/intrusion detection– Predict user’s actions (allows pre-fetching)

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 23: Web Mining

Problems with Web Logs• Typically a 30 minute timeout is used

• Web content may be dynamic– May not be able to reconstruct what the user saw

• Use of spiders and automated agents – automatic request web pages

• Like most data mining tasks, web log mining requires preprocessing– To identify users– To match sessions to other data– To fill in missing data– Essentially, to reconstruct the click stream

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 24: Web Mining

Problems with Web Logs• Identifying users

– Clients may have multiple streams– Clients may access web from multiple hosts– Proxy servers: many clients/one address– Proxy servers: one client/many addresses

• Data not in log– POST data (i.e., CGI request) not recorded– Cookie data stored elsewhere

• Other issues – When does a session end– Pages may be cached

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

Page 25: Web Mining

Web Log – Data Mining Applications

mandolin.cais.ntu.edu.sg/wise2002/web-mining-WISE-30.ppt

• Association rules– Find pages that are often viewed together

• Clustering– Cluster users based on browsing patterns– Cluster pages based on content

• Classification– Relate user attributes to patterns

Page 26: Web Mining

Web Logs• Web servers have the ability to log all requests

• Web server log formats:– Most use the Common Log Format (CLF)

– New, Extended Log Format allows configuration of log file

• Design of a Web Log Miner:– Web log is filtered to generate a relational database– A data cube is generated from the database– OLAP is used to drill-down and roll-up in the cube– OLAM is used for mining interesting knowledge

∑∈

⋅ε−+ε=G)p,q( )q(redeou

)q(R)1(n/)p(R

1Data Cleaning

2Data CubeCreation

3OLAP

4Data Mining

Web log Database Data Cube Sliced and dicedcube

Knowledge

Page 27: Web Mining

Web Logs

http://mate.dm.uba.ar/~pfmislej/web%20mining/web%20mining.pdf

Page 28: Web Mining

WEB MINING APPLICATIONS• Personalization, Recommendation engines

• Web-commerce applications

• Intelligent web search

• Hypertext classification and Categorization

• Information/trend monitoring

• Analysis of online communities

• Improving the relationship between the website and the user – Recommendations to modify the web site structure and content– Web personalization– Intelligent web site – They are systems that “based on the user

behavior, allow implementation of changes to the current web site structure and content”

paginas.fe.up.pt/~ec/files_0506/slides/06_WebMining.pdf

Page 29: Web Mining

http://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf

Personalization of Webpages

Page 30: Web Mining

CONCLUSION

• Web has been adopted as a critical communication and information medium by a majority of the population.

• Web data is growing at a significant rate.• A number of new Computer Science concepts and techniques

have been developed.• Many successful applications exist.• Fertile area of research.• Privacy –real debate needed.

Page 31: Web Mining

VISUAL WEB MINING Presentation 2

CSE 590 DATA MININGProf. Anita Wasilewska

SUNY Stony Brook

Presented By:Alka Simha 106677801

Avanthi Gupta 106616697Megha Krishnamurthy 106616749

Page 32: Web Mining

WWW2004, May 17–22, 2004, New York, New York, USA. ACM 1-58113-912-8/04/0005

Amir H. Youssefi David J. Duke Mohammed J. ZakiRensselaer Polytechnic Institute University of Bath Rensselaer Polytechnic

Institute

[email protected] [email protected] [email protected]

International World Wide Web Conference May 17 – 22, 2004

Visual Web Mining

Page 33: Web Mining

References

• http://www.cs.rpi.edu/~zaki/PS/WWW04.p df

• http://www.cs.rpi.edu/~youssefi/research/V WM/

• http://www.vtk.org/• http://www.w3.org/Robot/• http://www.cs.rpi.edu

Page 34: Web Mining

Overview

• What is Visual Web Mining• Abstract• Introduction• Visual Web Mining Architecture• Visual Representation• Design and Implementation of diagrams• Conclusion

Page 35: Web Mining

What is Visual Web Mining

Application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain.

http://www.cs.rpi.edu/~youssefi/research/VWM/

Page 36: Web Mining

Abstract

Analysis of web site usage data involves two significant challenges:

• Volume of data arising from the growth of the web.• Structural complexity of web sites.

Page 37: Web Mining

In this paper• Applied Data Mining and Information Visualization techniques to the

web domain; in order to benefit from the power of both human visual perception and computing.

• Applied Data Mining techniques to large web data sets and use Information Visualization methods on the results.

GOAL: - To correlate the outcomes of mining Web Usage Logs and the

extracted Web Structure, by visually superimposing the results.

Page 38: Web Mining

Introduction

• Information Visualization

Visual representations of abstract data, using computer-supported, interactive visual interfaces to reinforce human cognition; thus enabling the viewer to gain knowledge about the internal structure of the data and relationships in it.

• Visual Web Mining FrameworkProvides a prototype implementation for applying information visualization techniques to the results of Data Mining.

• User Session Compact sequence of web accesses by a user.

Page 39: Web Mining

• Visualization in order to:- Understand the structure of a particular website.- Web surfers’ behavior when visiting that website.

• Due to the large dataset and the structural complexity of the sites, 3D visual representations are used.

• Implemented using an open source toolkit called the Visualization Tool Kit (VTK).

- VTK consists of a C++ class library and several interpreted interface layers including Tcl/Tk, Java, and Python.

http://www.vtk.org/

Page 40: Web Mining

Visual Web Mining Architecture

Page 41: Web Mining

Visual Web Mining Architecture

• Input:- Web pages and Web server log files.- web robot (webbot) is used to retrieve the pages of the website.

- The webbot is a very fast Web walker with support for regular expressions, SQL logging facilities, and many other features. It can be used to check links, find bad HTML, map out a web site, download images, etc.

• In parallel, Web Server Log files are downloaded and processed through a sessionizer and a LOGML file is generated.

• The Integration Engine is a suite of programs for data preparation, i.e., extracting, cleaning, transforming and integrating data and finally loading into database and later generating graphs in XGML.

http://www.w3.org/Robot/

Page 42: Web Mining

Visual Web Mining Architecture• User sessions from web logs are extracted, which yields results roughly related to a

specific user. • User sessions are then converted into a special format for Sequence Mining using

cSPADE (continues Spade - Sequential PAttern Discovery Using Equivalent Class).

• Outputs:- Frequent contiguous sequences with a given minimum support. - These are imported into a database, and non-maximal frequent sequences are removed.

- Different queries are executed against this data according to some criterion, e.g. support of each pattern, length of patterns, etc.

- Different URLs which correspond to the same webpage are unified in the final results.

• The Visualization Stage: Maps the extracted data and attributes into visual images, realized through VTK extended with support for graphs.

• Result: Interactive 3D/2D visualizations which could be used by analysts to compare actual web surfing patterns with expected patterns.

Page 43: Web Mining

Visual Representation

Structures :- Graphs

Extract spanning tree from the site structure, and use this as the framework for presenting access-related results through glyphs(an element of writing) and color mapping.

- Stream Tubes

Variable-width tubes showing access paths with different traffic are introduced on top of the web graph structure.

Page 44: Web Mining

This is a visualization of the web graph of the Computer Science department of Rensselaer Polytechnic Institute. Strahler numbers are used for assigning colors to edges.

One can see user access paths scattering from first page of website(the node in center) to cluster of web pages corresponding to faculty pages, course home pages, etc.

Design and Implementation of Diagrams

http://www.cs.rpi.edu

Strahler numbers is a numerical measure of the branching complexity for assigning colors to the edges.

2D visualization layout with Strahler Coloring applied on web usage logs

Page 45: Web Mining

Adding third dimension enables visualization of more information and clarifies user behavior in and between clusters. Center node of circular basement is first page of web site from which users scatter to different clusters of web pages. Color spectrum from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users.

The cylinder like part of this figure is visualization of web usage of surfers as they browse a long HTML document.

3D visualization layout with Strahler Coloring applied on web usage logs

Page 46: Web Mining

Left: One can observe long user sessions as strings falling off. Those are special type of long sessions when user navigates sequence of web pages which come one after the other e.g., sections of a long document. In many cases were found web pages with many nodes connected with Next/Up/Previous hyperlinks.

Right: An enlarged view of the same visualization.

Page 47: Web Mining

Frequent access patterns extracted by the web mining process are visualized as a white graph on top of an embedded and colorful graph of web usage.

Superimposition of Frequent Patterns extracted from Web Mining on top of Web Usage

Page 48: Web Mining

Similar to last picture with addition of another attribute, i.e., frequency of pattern which is rendered as thickness of white tubes.This helps in the analysis of results.

Thickness of the tubes represen ts frequency of found patterns

Page 49: Web Mining

Higher Order layout for clear visualization and easier analysis

Superimposition of Web Usage on top of WebStructure with higher order layout. Top node is the first page of the website. Hierarchical output of layouts make analysis easier.

Page 50: Web Mining

Left: Superimposition of website dynamics(colored) on top of its static structure(gray)

Right: Zoom view of colored region with layout of Web Usage taken from Web Graph basement. The basement itself is removed for clarity

Page 51: Web Mining

Conclusion

- Using the visualizations, a web analyzer can easily identify which parts of the website are cold parts with few hits and which parts are hot ones with many hits and classify them accordingly.

This also paves way for making exploratory changes in website.

- For e.g., adding links from hot parts of web site to cold parts and then extracting, visualizing and interpreting changes in access patterns.

Page 52: Web Mining

• An algorithm based on Apriori for fast discovery of frequent sequences• Needs three database scans in order to extract sequential patterns• Given: A database of customer transactions, each of which having the

following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction.

• The aim is to obtain typical behaviors according to the user's viewpoint.

SPADE OVERVIEW

Page 53: Web Mining

User’s browsing access pattern is amplified by a different coloring

Depending on link structure of underlying pages, we can see vertical access patterns of a user drilling down the cluster, making a cylinder shape.

Also users following links going down a hierarchy of webpages makes a cone shape and users going up hierarchies, e.g., back to main page of website makes a funnel shape.

Amplification of a user session: Clickstream(Bottom Left) in drill down cylinder, Cone Scatter(Top Right) and Funnel Backoff to main page of website (Top Right)