aula 6 webmining.ppt - paginas.fe.up.ptec/files_0910/slides/aula_6_webmining.pdf · tf: term...
TRANSCRIPT
Web MiningWeb MiningWeb MiningWeb Mining
Based on several presentations found on the web: Sh i Ull T i P dShapiro, Ullman, Terziyan, Pedersen ...
1
Wh t i W b Mi i ?Wh t i W b Mi i ?What is Web Mining?What is Web Mining?
Web mining is the use of data mining techniques to automatically discover and extract information automat cally d scover and extract nformat on from Web documents/services
(Et i i 1996 CACM 39(11))(Etzioni, 1996, CACM 39(11))
Web mining aims to discovery useful information or m g m y f f mknowledge from the Web hyperlink structure, page content and usage data.g
(Bing LIU 2007, Web Data Mining, Springer)
2
Wh t i W b Mi i ?Wh t i W b Mi i ?What is Web Mining?What is Web Mining?
Motivation / Opportunity
The WWW is huge, widely distributed, global information service d h f h f d centre and, therefore, constitutes a rich source for data mining
Intelligent Web Search
P li ti R d ti E i Personalization, Recommendation Engines
Web-commerce applications
Building the Semantic Web Building the Semantic Web
Web page classification and categorization
News classification and clustering News classification and clustering
Information / trend monitoring
Analysis of online communities
3
y
Web and mail spam filtering
Ab d d th it i iAb d d th it i iAbundance and authority crisisAbundance and authority crisis
Liberal and informal culture of content generation and dissemination
Redundancy and non-standard form and content
Millions of qualifying pages for most broad queriesM ll ons of qual fy ng pages for most broad quer es
Example: java or kayaking
N th it ti i f ti b t th li bilit f it No authoritative information about the reliability of a site
Little support for adapting to the background of specific users
Pages added continuously and average page changes in a few weeks
4
5
Diff t f “ l i l” D t Mi i ?Diff t f “ l i l” D t Mi i ?Different from “classical” Data Mining?Different from “classical” Data Mining?
The web is not a relation
Textual information + linkage structure
Usage data is huge and growing rapidly
Google’s usage logs are bigger than their web crawl
Data generated per day is comparable to largest conventional Data generated per day is comparable to largest conventional data warehouses
6
Si f th W bSi f th W bSize of the WebSize of the Web
Number of pages Number of pages
11.5 billion indexable pages (http://www.cs.uiowa.edu/~asignori/web-size/ www2005)
Technically, infinite
Because of dynamically generated content
Lots of duplication (30-40%)
Best estimate of “unique” static HTML pages comes from search i l iengine claims
Yahoo = claimed 19.2 billion in Aug 2005
Number of unique web sites
Netcraft survey says 98 million sites
7
O t b 2006 W b S SO t b 2006 W b S SOctober 2006 Web Server SurveyOctober 2006 Web Server Survey
8
http://news.netcraft.com/archives/web_server_survey.html
from from http://www.worldwidewebsize.com/http://www.worldwidewebsize.com/
9
A th t tim t th b iA th t tim t th b iAnother way to estimate the web sizeAnother way to estimate the web size
The number of web servers was estimated by samplingand testing random IP address numbers and determining the fraction of such tests that successfully located a web server
The estimate of the average number of pages per server was obtained by crawling a sample of the serversserver was obtained by crawling a sample of the serversidentified in the first experiment
Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web Nature 400(6740): 107–109
10
web. Nature, 400(6740): 107–109.
Web Information RetrievalWeb Information Retrievalf mf m
According to most predictions, the majority of human informationwill be available on the Web in ten??? years
Effective information retrieval can aid in Research: Find all papers about web mining Research: Find all papers about web mining
Health/ Medicine : What could be reason for symptoms of “yellow eyes”, high fever and frequent vomiting
Travel: Find information on the tropical island of St. Lucia
Business: Find companies that manufacture digital signal processors
Entertainment: Find all movies starring Marilyn Monroe during the years 1960 and 1970
Arts: Find all short stories written by Jhumpa Lahiri
11
Arts: Find all short stories written by Jhumpa Lahiri
Why is Web Information Retrieval Difficult?Why is Web Information Retrieval Difficult?Why is Web Information Retrieval Difficult?Why is Web Information Retrieval Difficult?
The Abundance Problem (99% of information of no interest to 99% The Abundance Problem (99% of information of no interest to 99% of people) Hundreds of irrelevant documents returned in response to a search p
query
Limited Coverage of the Web (Internet sources hidden behind search interfaces)search interfaces) Largest crawlers cover less than 18% of Web pages
The Web is extremely dynamic The Web is extremely dynamic Lots of pages added, removed and changed every day
Very high dimensionality (thousands of dimensions) Very high dimensionality (thousands of dimensions)
Limited query interface based on keyword-oriented search
Limited cust mizati n t individual users
12
Limited customization to individual users
Search LandscapeSearch Landscape 2005pp
Sept 2009Sept 2009
13http://marketshare.hitslink.com/search-engine-market-share.aspx?qprid=4
S h E i W b C O lS h E i W b C O lSearch Engine Web Coverage OverlapSearch Engine Web Coverage Overlap
4 searches were d f d h defined that returned 141 web pages.
14
http://www.searchengineshowdown.com/stats/overlap.shtml
W b h b iW b h b iWeb search basicsWeb search basics
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping!
User
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this ]
pp gAll models. Helpful advice. www.best-vacuum.com
Web crawlerpage ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
IndexerSearch
The Web
15Ad indexesIndexes
W b C li B iW b C li B iWeb Crawling BasicsWeb Crawling Basics
Start with a “seed set” of to-visit urls
get next urlto visit urls
get pagevisited urlsWeb
extract urls
visited urlseb
web pages
16
C li IC li ICrawling IssuesCrawling Issues
L ad n web servers Load on web servers
E.g., no more than 1 request to the same server every 10 seconds
Insufficient resources to crawl entire web
Visit “important” pages first (pagerank, inlinks …)
How to keep crawled pages “fresh”?
How often do web pages change? What do we mean by freshness?p g g y
Detecting replicated content e.g., mirrors
Use document comparison techniques (java manuals) Use document comparison techniques (java manuals)
Can’t crawl the web from one machine
P ll li i th l
17
Parallelizing the crawl
W b Ad ti iW b Ad ti iWeb AdvertisingWeb Advertising Banner ads (1995-2001) Banner ads (1995 2001)
Initial form of web advertising
P p l bsit s h d X$ f 1000 “imp ssi ns” f d Popular websites charged X$ for every 1000 impressions” of ad
Modeled similar to TV, magazine ads
L li kth u h t s Low clickthrough rates
low ROI for advertisers
I t d d b O t d 2000 Introduced by Overture around 2000
Advertisers “bid” on search keywords
When someone searches for that keyword, the highest bidder’s ad is shown
Advertiser is charged only if the ad is clicked on
18
Advertiser is charged only if the ad is clicked on
W b Ad ti iW b Ad ti iWeb AdvertisingWeb Advertising
Search advertising is the revenue model
Multi-billion-dollar industry Multi-billion-dollar industry
Advertisers pay for clicks on their ads
Interesting problems
What ads to show for a search? What ads to show for a search?
Maximise revenue, each advertiser has a limited budget
If I’m an advertiser, which search terms should I bid on and how much to bid?
19
Web Mining TaxonomyWeb Mining TaxonomyWeb Mining TaxonomyWeb Mining Taxonomy
Web Mining
Web S
Web C Web UsageStructure
MiningContentMining
Web UsageMining
20
Web Mining TaxonomyWeb Mining TaxonomyWeb Mining TaxonomyWeb Mining Taxonomy Web content mining: focuses on techniques for
assisting a user in finding documents that meet a certain criterion
Web structure mining: aims at developing techniques to Web structure mining: aims at developing techniques to take advantage of the collective judgement of web page quality which is available in the form of hyperlinksquality which is available in the form of hyperlinks
Web usage mining: focuses on techniques to study the user behaviour when navigating the web(also known as Web log mining and clickstream analysis)
21
W b C t t Mi iW b C t t Mi iWeb Content MiningWeb Content Mining
Examines the content of web pages as well as results of web searching.
22
W b C t t MiW b C t t MiWeb Content MinngWeb Content Minng
Can be thought of as extending the work performed by basic search engines
Search engines have crawlers to search the web and ggather information, indexing techniques to store the information, and query processing support to provide q y p g pp pinformation to the users
Web Content Mining is: the process of extracting knowledge from web contents
23
g
I f m ti R t i lI f m ti R t i lInformation RetrievalInformation Retrieval
Given:
A source of textual
Documentssource
documents
A user query (text based) IRQueryq y ( ) IRSystem
Fi d Find:
A set (ranked) of documents Document
DocumentD tthat are relevant to the
queryRanked
DocumentsDocument
24
S miS mi St t d D tSt t d D tSemiSemi--Structured DataStructured Data
Text content is in general semi structured Text content is, in general, semi-structured
Example:
Title
A th Author
Publication_Date Structured attribute/value pairs
Length
Category Category
AbstractUnstructured
25
Content
Structuring Textual InformationStructuring Textual InformationStructuring Textual InformationStructuring Textual Information
Many methods designed to analyze structured data
If we can represent documents by a set of attributes we will be able to use existing data mining methods
How to represent a document?
Vector based representation p
(referred to as “bag of words” as it is invariant to permutations)
Use statistics to add a numerical dimension to unstructured text
Term frequencyf q y
Document frequency
Term proximity
26Document length
Term proximity
Document RepresentationDocument RepresentationDocument RepresentationDocument Representation
A document representation aims to capture what the document m p m p mis about
One possible approach (boolean representation):
Each entry describes a document
Attribute describe whether or not a term appears in the ddocument
Example Terms
Camera Digital Memory Pixel …
Document 1 1 1 0 1Document 1 1 1 0 1
Document 2 1 1 0 0
27
… … … … …
Document RepresentationDocument RepresentationDocument RepresentationDocument Representation
Another approach:
Each entry describes a document
Attributes represent the frequency in which a term appears in the document
Example: Term frequency table
Terms
Camera Digital Memory Print …
Document 1 3 2 0 1
Document 2 0 4 0 3
28… … … … …
Document RepresentationDocument RepresentationDocument RepresentationDocument Representation
But a term is mentioned more times in longer documents
Th f l ti f (% f d t) Therefore, use relative frequency (% of document):
No. of occurrences/No. of words in document
Terms
Camera Digital Memory Print …
Document 1 0.03 0.02 0 0.01
Document 2 0 0.004 0 0.003
29… … … … …
More on Document RepresentationMore on Document RepresentationMore on Document RepresentationMore on Document Representation
Stop Word removal: Many words are not informative and thus Stop Word removal: Many words are not informative and thus irrelevant for document representation
the, and, a, an, is, of, that, …th , an , a, an, s, of, that, …
Stemming: reducing words to their root form (Reduce dimensionality)
A document may contain several occurrences of words like A document may contain several occurrences of words like
fish, fishes, fisher, and fishers
But would not be retrieved by a query with the keyword y q y y
fishing
Different words share the same word stem and should be represented with its stem, instead of the actual word
Fish
30
For the Portuguese language these techniques are less studied
Weighting Scheme for Term FrequenciesWeighting Scheme for Term FrequenciesWeighting Scheme for Term FrequenciesWeighting Scheme for Term Frequencies TF-IDF weighting: give higher weight to terms that are rareTF IDF weighting give higher weight to terms that are rare
TF: term frequency (increases weight of frequent terms)
If a term is frequent in lots of documents it does not have discriminative power If a term is frequent in lots of documents it does not have discriminative power
IDF: inverse term frequency
ijij
ij
dwndw
document in of soccurrence of number the is document and term given a For
i
ijij d
nTF
i
ijij
nd
w
documents of number the is document in words of number the is d
umfufum
i
nn
IDF jj log
jj w n contain that documents of number the is jijij IDFTFx
31
There is no compelling motivation for this method but it has been shown to be superior to other methods
Locating Relevant DocumentsLocating Relevant DocumentsLocating Relevant DocumentsLocating Relevant Documents
Given a set of keywords
Use similarity/distance measure to find similar/relevant documents
Rank documents by their relevance/similarity
H t d t i if t d t i il ?How to determine if two documents are similar?
32
Di t B d M t hiDi t B d M t hi In order retrieve documents similar to a given document we need a
Distance Based MatchingDistance Based Matching
measure of similarity
Euclidean distance (example of a metric distance):
The Euclidean distance between
X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) X (x1, x2, x3,…xn) and Y (y1,y2, y3,…yn)
is defined as:
n B C
n
iii yxYXD
1
2)(),(B C
AProperties of a metric distance:
• D(X,X)=0• D(X Y)=D(Y X)
33
D• D(X,Y)=D(Y,X)• D(X,Z)+D(Z,Y) ≥ D(X,Y)
Angle Based MatchingAngle Based MatchingAngle Based MatchingAngle Based Matching Cosine of the angle between the vectors representing the document
and the query
Documents “in the same direction” are closely related.y
Transforms the angular measure into a measure ranging from 1 for the highest similarity to 0 for the lowestthe highest similarity to 0 for the lowest
B CT
T
YXYXYXYXD ),cos(),(
A
D
22
ii
ii
yxyx
34
D
Performance MeasurePerformance Measure
The set of retrieved documents can be formed by collecting the top-ranking documents according to a similarity measure
The quality of a collection can be compared by the two following measures
}{}{}{
RetrievedRetrievedRelevant
precision
percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
}{}{}{
}{
RelevantRetrievedRelevant
recall
Retrieved
percentage of documents that are relevant to the query and were, in fact, retrieved
}{
Retrieveddocuments
Relevantdocuments
Relevant &retrieved
35
All documents
I t lli t W b S hI t lli t W b S hIntelligent Web SearchIntelligent Web Search
Combine the intelligent IR tools
meaning of words meaning of words
order of words in the query
authority of the source
With the unique web features
retrieve Hyper-link information
utilize Hyper-link as input
36
yp p
Text MininText MininText MiningText Mining Data mining in text: find something useful and surprising from a
text collection;t t i i i f ti t i l text mining vs. information retrieval;
data mining vs. database queries.
D t l ifi ti Document classification Topic hierarchies, spam filters
l Document clustering cluster documents by a common author
cluster documents containing information from a common source (fraud)
Key-word based association rules
37
Key-word based association rules
Clustering and automatic automatic clusters’ labeling
38http://clusty.com
39 40
Web Structure MiningWeb Structure Mining
Exploiting Hyperlink Structure
Social network analysis
41
Fi t ti f h iFi t ti f h iFirst generation of search enginesFirst generation of search engines
E l d k d b d h Early days: keyword based searches
Keywords: “web mining”
Retrieves documents with “web” and mining”
L t ith Later on: cope with
synonymy problem
polysemy problem
stop words stop words
Common characteristic: Only information on the
42
pages is used
M d h i M d h i Modern search engines Modern search engines
Link structure is very important
Adding a link: deliberate act Adding a link: deliberate act
Harder to fool systems using in-links
Link is a “quality mark”
A page is important if important pages link to itp g p p p g
M d h l k Modern search engines use link structure as important source of information
43
C t l Q tiCentral Question:
Which useful information can be Which useful information can be Wh ch u fu nf rmat n can Wh ch u fu nf rmat n can derived derived
f th li k t t f th b?f th li k t t f th b?from the link structure of the web?from the link structure of the web?
44
Some answersSome answersSome answersSome answers
1. Structure of Internet1. Structure of Internet
2. Google2. Google
3. HITS: Hubs and Authorities3. HITS Hubs and Authorities
45
1 Th W b St t 1 Th W b St t 1. The Web Structure 1. The Web Structure
A study was conducted on a graph inferred from two large Altavista crawls.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW Conference.
Th st d fi m d th h p th sis th t th mb f The study confirmed the hypothesis that the number of in-links and out-links to a page approximately follows a Zipf distribution (a particular case of a power law)Zipf distribution (a particular case of a power-law)
46
P LP LPower LawsPower Laws
47
II Li kLi kInIn--LinksLinks
48
O tO t Li kLi kOutOut--LinksLinks
49
Th W b St t Th W b St t The Web Structure The Web Structure
If the web is treated as an undirected graph
90% of the pages form a single connected componentcomponent
If the web is treated as a directed graphIf the web is treated as a directed graph
four distinct components are identified, the four p ,with similar size
50
General TopologyGeneral Topology
TendrilsTendrils44mil
SCCIN OUTSCCIN44mil 44mil56mil
DisconnectedTubes Disconnected components
Tubes
SCC: set of pages that can be reached by one anotherIN: pages that have a path to SCC but not from it
51
IN: pages that have a path to SCC but not from itOUT: pages that can be reached by SCC but not reach itTENDRILS: pages that cannot reach and be reached the SCC pages
Some statisticsSome statistics Only between 25% of the pages there is a connecting path
BUT
If there is a path: If there is a path:
Directed: average length <17
U di t d l th 7 (!!!) Undirected: average length <7 (!!!)
It’s a “small world” -> between two people only chain of length 6!(http://en.wikipedia.org/wiki/Small_world_phenomenon)
Small World Graphs
High number of relatively small cliques
Small diameter
52
Internet (SCC) is a small world graph
53
GoogleGoogleGoogleGoogle Search engine that uses link structure to calculate a g
quality ranking (PageRank) for each page
I t iti P R k b th b bilit th t Intuition: PageRank can be seen as the probability that a “random surfer” visits a page Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc. WWW Conference, pages 107–117
A page is important if important pages link to it
54http://paspespuyas.com/comunidad/media/pageRank.gif
G lG lGoogleGoogle
Keywords w entered by user
Select pages containing w and pages which have in-links Select pages containing w and pages which have in links with caption w Anch r t xt Anchor text
Provides more accurate descriptions of Web pages
Anchors exist for un-indexable documents (e.g., images)
Font sizes of words in text:
Words in larger or bolder font are assigned higher weights
Rank pages according to importance
55
p g g mp
PageRankPageRankPageRankPageRank(P R nk) + (W bsit C nt nt) Ov r ll R nk in R sults
Page RankPage Rank:: A page is important if many important pages link to it.
(PageRank) + (Website Content) = Overall Rank in Results
Link ij :
ag anag an p g mp f m y mp p g .
j i considers j important. the more important i, the more important j becomes. if i has many out-links: links are less important.
Initially: all importances pi = 1. Iteratively, pi is refined.
56
ji iOutDegree
iPageRankppjPageRank)(
)()()( 1
PageRankPageRank
Let OutDegreei = # out-links of page i
Adjust pj:j pj
ji iOutDegree
iPageRankppjPageRank)(
)()()( 1
h h h h h
ji g )(
This is the weighted sum of the importance of the pages referring to PjParameter p is probability that the surfer gets bored and starts on a new random pagep g(1-p) is the probability that the random surfer follows a link on current page
57
P R kP R kPageRankPageRank
58Repeat until pagerank vector converges…
HITS HITS (Hyperlink(Hyperlink--Induced Topic Search)Induced Topic Search)HITS HITS (Hyperlink(Hyperlink Induced Topic Search)Induced Topic Search)
HITS uses hyperlink structure to identify authoritative Web sources for broad-topic information discovery
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.
Premise: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages:g yp yp p g
Authorities: highly-referenced pages on a topic
H b th t “ i t” t th iti Hubs: pages that “point” to authorities
A good authority is pointed to by many good hubs; a good hub d h
59
points to many good authorities
HubsHubsHubsHubs
Pages that link to a collection of authoritative pages on a broad topicpages point to interesting links to authorities = relevant pages
60
AuthoritiesAuthoritiesAuthoritiesAuthorities
Relevant pages of the highest quality on a broad topicRelevant pages of the highest quality on a broad topic
61
HITSHITS Steps for Discovering Hubs and Authorities on a
specific topicspecific topic Collect seed set of pages S (returned by search engine)
Expand seed set to contain pages that point to or are pointedto by pages in seed set (removes links inside a site)
Iteratively update hub weight h(p) and authority weight a(p) for each page:
qppq
qaphqhpa )()()()(
After a fixed number of iterations, pages with highest hub/authority weights form core of community
62
hub/authority weights form core of community
St th d k f HITS St th d k f HITS Strengths and weaknesses of HITS Strengths and weaknesses of HITS
Strength: its ability to rank pages according to the Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages relevant authority and hub pages.
Weaknesses:
It is easily spammed. It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy. g p g y
Topic drift. Many pages in the expanded set may not be on topic. p
Inefficiency at query time: The query time evaluation is slow. Collecting the root set, expanding it and performing
63
slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations
Web Usage MiningWeb Usage Miningg gg g
l i b i tianalyzing user web navigation
64
Web Usage MiningWeb Usage MiningWeb Usage MiningWeb Usage Mining
Pages contain information
Links are “roads” Links are roads
How do people navigate over the Internet?
Web usage mining (Clickstream Analysis)
Information on navigation paths is available in log files Information on navigation paths is available in log files.
Logs can be examined from either a client or a server
65
perspective.
W b it U A l iW b it U A l iWebsite Usage AnalysisWebsite Usage Analysis
Why analyze Website usa e? Why analyze Website usage?
Knowledge about how visitors use Website could
Provide guidelines to web site reorganization; Help prevent disorientation
Help designers place important information where the visitors look for it
Pre-fetching and caching web pages
Provide adaptive Website (Personalization)
Questions which could be answered
What are the differences in usage and access patterns among users?
What user behaviours change over time?
How usage patterns change with quality of service (slow/fast)?
66
What is the distribution of network traffic over time?
Website Usage AnalysisWebsite Usage Analysis
67
Data SourcesData SourcesData SourcesData Sources
68
D t SD t SData SourcesData Sources
Server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source;
Client level collection: it is the client itself which sends to a f wrepository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser pp ) y y g g(such as Mosaic or Mozilla) to enhance its data collection capabilities. );
Proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose
69
g , yWeb clients pass through the proxy.
W b U Mi i PW b U Mi i PWeb Usage Mining ProcessWeb Usage Mining Process
WebSServer
Log DataPreparation Clean Data
Mi iPreparation Data Mining
SitSiteData Usage
Patterns
70
D t P tiD t P tiData PreparationData Preparation Data cleaningData cleaning
By checking the suffix of the URL name, for example, all log entries with filename suffixes such as, gif, jpeg, etc, g , jp g,
User identification
If p is st d th t is n t di tl link d t th p i s p s If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine
Other heuristics involve using a combination of IP address machine Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users
Transaction identification Transaction identification
All of the page references made by a user during a single visit to a site
Si f t ti f i l f t ll f
71
Size of a transaction can range from a single page reference to all of the page references
A l A l W b L Fil A lW b L Fil A lAnalog Analog –– Web Log File AnalyserWeb Log File Analyserhttp://www.analog.cx/
Gives basic statistics such as number of hits number of hits average hits per time period what are the popular pages in your site what are the popular pages in your site who is visiting your site what keywords are users searching for to get to you what keywords are users searching for to get to you what is being downloaded
72
W b U Mi iW b U Mi iWeb Usage MiningWeb Usage Mining
Commonly used approaches
Preprocessing data and adapting existing data mining techniques
For example associatin rules: does not take into account the order of the page requestsorder of the page requests
Developing novel data mining modelsp g g
73
Data Mining on Web TransactionsData Mining on Web TransactionsData M n ng on W ransact onsData M n ng on W ransact ons
Association Rules: discovers similarity among sets of items across transactions
X =====> Y
where X, Y are sets of items, confidence or P(X v Y),
support or P(X^Y)
Examples: 60% of clients who accessed /products/, also accessed
/products/software/webminer.htm. 30% of clients who accessed /special-offer.html, placed an online order in
/products/software/.
(Actual Example from IBM official Olympics Site)
{Badminton Diving} ===> {Table Tennis} (69 7% 35%)
74
{Badminton, Diving} ===> {Table Tennis} (69.7%,.35%)
Other Data Mining TechniquesOther Data Mining Techniques
Sequential Patterns:
30% of clients who visited /products/software/, had done a search in Yahoousing the keyword “software” before their visit
60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days
Clustering and Classification Clustering and Classification
clients who often access /products/software/webminer.html tend to be from educational institutions.from educational institutions.
clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.
75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends.
75
Path and Usage Pattern DiscoveryPath and Usage Pattern Discoveryath an Usag att rn D sco ryath an Usag att rn D sco ry
Types of Path/Usage Information
Most Frequent paths traversed by users
Entry and Exit Points
Distribution of user session duration
Examples: Examples:
60% of clients who accessed /home/products/file1.html, followed the path /home ==> /home/whatsnew ==> /home/productsfollowed the path /home > /home/whatsnew > /home/products==> /home/products/file1.html
(Olympics Web site) 30% of clients who accessed sport specific pages( y p ) p p p gstarted from the Sneakpeek page.
65% of clients left the site after 4 or less references.
76
A Fi l Q tiA Fi l Q tiA Final QuestionA Final Question
Google Tools: search engine, Google news, Google maps, Gmail, Calendar, Google docs, YouTube, Orkut, Google Desktop, Chrome, Android...
Do You Trust Google to Resist Data Mining Across Services?
http://www readwriteweb com/archives/do you trust http://www.readwriteweb.com/archives/do_you_trust_google_to_resist_data_mining_across_services.php
77
S mmS mmSummarySummary
Web is huge and dynamic
Web mining makes use of data mining techniques to Web mining makes use of data mining techniques to automatically discover and extract information from Web documents/services Web documents/services
Web content mining
Web structure mining
Web usage mining Web usage mining
78
R fR fReferencesReferences
Data Mining: Introductory and Advanced Topics, Margaret Dunham (Prentice Hall, 2002)
Mining the Web - Discovering Knowledge from Hypertext Data Soumen Chakrabarti MorganHypertext Data, Soumen Chakrabarti, Morgan-Kaufmann Publishers
79
Thank you !!!Thank you !!!80
Thank you !!!Thank you !!!