mandar mitra cvpr unit indian statistical institute kolkata
TRANSCRIPT
Web Mining: An Overview
Mandar Mitra
CVPR Unit
Indian Statistical Institute
Kolkata
Web Mining: An Overview – p. 1/45
Overview
What is Web mining?
Classification of Web mining tasks
Challenges
Web content mining
Web structure mining
Web usage mining
References
Web Mining: An Overview – p. 2/45
What is Web Mining?
Web mining is the automatic discovery and extraction ofpotentially useful and previously unknown information fromWeb dataOld wine in a new bottle?
Web mining = databases + information retrieval + artificialintelligence (natural language processing, machinelearning) + . . .
So, why the interest?multidisciplinary naturegrowth of Web information sourcese-commerce potential: “Electronic commerce isemerging as the killer domain for data-miningtechnology”
Web Mining: An Overview – p. 3/45
Web Mining Tasks
Content mining: mine the content of documents/pagesretrieval, clustering of search results, filtering,summarization, classification / categorization, etc.
Structure mining: study the link structure of pages and sitesauthorities and hubs, page ranking (Google), detection ofcommunities
Usage mining: analyze usage data, surfingbehaviour/patterns
site restructuring, marketing
Compartments are not water-tightsearching, filtering (content-based / collaborative /reputation-based)
Web Mining: An Overview – p. 4/45
Challenges
Unstructured and heterogeneousMultimediaSize + rapid growth
1 new server every 2 hours5 million documents in 1995 to 320 million documents in1998
DynamicNetworked/distributed
Web Mining: An Overview – p. 5/45
Content Mining
Types of data: text, images, audio, video, databasesText is most important
Unstructured – free textSemi-structured – HTML documentsStructured – tables, documents generated from databases
Web Mining: An Overview – p. 6/45
Text Mining: Outline
IndexingSearchingFilteringWord relationshipsClassificationDiscovering document relationshipsSummarization
Web Mining: An Overview – p. 7/45
Text Mining: Indexing
Any text item (“document”) represented as list of terms andassociated weights
D = (〈t1, w1〉, . . . , 〈tn, wn〉)Term = keywords or content-descriptorsWeight = measure of the importance of a term inrepresenting the information contained in the document
Web Mining: An Overview – p. 8/45
Text Mining: Indexing
Tokenize: identify individual words
Stopword removal: eliminate common words, e.g. and, of,the, etc.
Stemming: reduce words to a common roote.g. analysis, analyze, analyzing→ analyuse standard algorithms (Porter)
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, datamining
use syntax/linguistic methods or “statistical” methods
Web Mining: An Overview – p. 9/45
Indexing: Term Weights
Term frequency (tf): repeated words are strongly related tocontentInverse document frequency (idf): uncommon term is moreimportantNormalization by document length
long docs. contain many distinct wordslong docs. contain same word many timesterm-weights for long documents should be reduceduse # bytes, # distinct words, Euclidean length, etc.
Weight = tf x idf / normalization
Web Mining: An Overview – p. 10/45
Text Mining: Searching
Measure vocabulary overlap between user query anddocuments
Sim(Q,D) =∑
i
wt(qi)× wt(di)
Use inverted list (index)
Termi → (Di1 , wi1), . . . , (Dik , wik)
Web Mining: An Overview – p. 11/45
Text Mining: Filtering
Aim: inform user about interesting new informatione.g. personalized news serviceMethod:1. User creates initial interest profile (= query)2. Each new document is compared to profile3. If similarity is "high enough", select and forward document
to user4. Refine query using user feedback5. New profile = α× old profile +
β
#rel.docs. ×∑
relevant docs -γ
#non-rel.docs. ×∑
non-relevant docs
6. Intuitively, add terms occurring in many relevantdocuments, remove terms occurring in many non-relevantdocuments
Web Mining: An Overview – p. 12/45
Text Mining: Word Relations
Motivation:Manual thesauri are:
general purpose (Roget’s Thesaurus, WordNet) – difficultto use for document retrievalretrieval-oriented (INSPEC, MeSH) – expensive to buildand maintain
Construct an automatic thesaurus (based on informationabout co-occurrence of words in a collection)
Web Mining: An Overview – p. 13/45
Text Mining: Word Relations
Association: if two terms co-occur within the sameparagraph, they constitute an association
〈term1, term2,assoc. frequency〉Gather data about term-associations over a large amount oftextRefine associations:
Discard associations with frequency 1Discard terms that are associated with too many otherterms (people, state, company, etc.)
Web Mining: An Overview – p. 14/45
Text Mining: Word Relations
Each term is represented by a vector of associated terms
T = (〈t1, w1〉, . . . , 〈tn, wn〉)⇒ term = pseudo documentCompare query to the term vectors (instead of documentvectors)
Sim(Q,T ) = Σiwt(qi)× wt(ti)Most “similar” terms are added to the queryExample: 1986 US Immigration Law
similar terms: illegal immigration, amnesty program,simpson-mazzoli
Web Mining: An Overview – p. 15/45
Text Mining: Word Relations
Experimental results:Data: 500,000 documents (news, computer abstracts, govt.documents); 50 queriesBaseline average precision: 37%Improves to 6 - 30% by using thesaurus2 weeks to generate association data!Processing time can be reduced without major loss inperformance by using a subset of the document collection
Web Mining: An Overview – p. 16/45
Text Mining: Classification
Users may prefer browsing through document collectioninstead of doing a direct keyword searchSearch sites organize web-pages into hierarchy of subjectcategoriese.g. Science > Physics > RelativityNew web-pages need to be inserted into appropriate classautomatically
Web Mining: An Overview – p. 17/45
Text Mining: Classification
Training: initially, documents are classified manuallyClass vector computed for each class based on thedocuments contained in that classe.g. DCS1
⇒algorithm, complexity, graphDCS2
⇒searching, text, algorithmDCS ⇒algorithm(2), complexity, ...
New document compared to class vectors at each level ofhierarchy to determine best fitExample:1. computeSim(D,CS), Sim(D,Physics), Sim(D,Maths), . . .
2. computeSim(D, relativity), Sim(D, optics), Sim(D,mechanics), . . .
Web Mining: An Overview – p. 18/45
Text Mining: Document Relationships
Creator of a web-page may not provide links to otherimportant related pagesLinks between related pages should be automaticallydiscovered
Related pages are expected to have a high similarityType of relationship should be detected if possible
summary-expansion, generalization-specialization, etc.
Web Mining: An Overview – p. 19/45
Text Mining: Document Relationships
Break each document into parasConstruct document relationship graph
nodes - parasedges - join paras that have a high similarity
Depending on patterns in the graph, likely relationship maybe detected
Web Mining: An Overview – p. 20/45
Document Relationships
Summary-Expansionnews-in-brief vs. news-in-detail
Web Mining: An Overview – p. 21/45
Document Relationships
Generalization-Specializationunmanned space missions vs. Pioneer 10
Web Mining: An Overview – p. 22/45
Text Mining: Summarization
Manual summarization method:read text and understand itextract salient pointswrite the summary
Automatic approximation:Break document into paragraphsCompute para-para similaritiesConstruct document relationship graphExtract “important” paras (expected to have high degree)
Bushy paras: paras connected to many other parasDepth first paras: from starting para, go to most similarpara
Comprehensiveness vs. coherencecomprehensive: covers salient pointscoherent: easy to read
Web Mining: An Overview – p. 23/45
Text Mining: Summarization
Evaluation:Manually extract best paras; compute overlap betweenautomatic and manual extractProblems:
agreement between humans is low (60%)just choosing first few paras works well
Web Mining: An Overview – p. 24/45
Structure Mining: Outline
Web as a graphDetecting hubs and authoritiesPage ranking (Google)Community detection
Web Mining: An Overview – p. 25/45
Structure Mining
Web is a directed graph: set of pages (nodes) connected byhyperlinks (edges)
Web Mining: An Overview – p. 26/45
Structure Mining
Based on about 200 million pages, 1.5 billion linksOne strongly connected component (path from each node toevery other node)IN – set of newly formed nodes with outgoing links into thecentreOUT – introvert nodes with only incoming links (e.g.corporate and e-commerce sites) from the centreTendrils and tubes (nodes in IN-tendrils connect to nodes inOUT-tendrils)Randomly chosen pair is connected only 24% of the time(average distance 16)
Web Mining: An Overview – p. 27/45
Structure Mining: HITS
Hyperlink Induced Topic SearchKleinberg, 1998
Identification ofauthorities – authoritative, high-quality web pages onbroad topicshubs – web pages that link to a collection of authorities
A good authority is pointed to by many good hubs
A good hub points to many good authorities
Inspired by the study of social networks and citation analysis
Web Mining: An Overview – p. 28/45
Structure Mining: HITS
Root set: given a broad query, collect the N highest rankedpages for the query from a text-based search engineExpanded set: add pages pointing to pages in root set, andpages pointed to by pages in root setIteratively update authority and hub scores
h(u) = a(v1) + a(v2) + a(v3)
u
v1
v2
v3
u1
u2
u3
v
a(v) = h(u1) + h(u2) + h(u3)
Web Mining: An Overview – p. 29/45
HITS: Problems, Solutions
Problems:Clique attacks (www.411fun.com, 411fashion.com,)etc.Mixed hubs and topic drifting
Solutions:make use of anchor text (the text surrounding a link) andboost weight of links which occur near instances of querytermseliminate outliers from the expanded setpartition mixed hubs into segments
Web Mining: An Overview – p. 30/45
Structure Mining: PageRank
Used in Google Search Engine’Global’ ranking of every web page calculated based onhyperlink structure of web (content ignored)Documents with matching keywords are returned in theglobal rank orderPrinciple: Highly linked pages are more important thanpages with a few linksA page has a high rank if the sum of the ranks of itsback-links is highMost effective for underspecified (general) queries
Web Mining: An Overview – p. 31/45
Structure Mining: Web Communities
Community: group of web pages sharing a common interestExplicit: Yahoo, Google, etc.Implicit: have to be discovered using content + hyperlinks
SimilarityMethod 1: A and B are related if one links to the otherMethod 2: A and B are related if
a number of pages contain links to both A and BA and B both link to a number of pages
Use matrix algebra methods (PCA, eigenvalue analysis), orgraph theoretic methods (community trawling)
Web Mining: An Overview – p. 32/45
Usage Mining: Outline
Premises and goalsCriteria for successArchitectureOpinion miningAmazon
Web Mining: An Overview – p. 33/45
Usage Mining: Data Sources
Server logsFor each access, web servers register an entry in a log filecontaining requesting host, user id, timestamp, pagerequested, browser type, referring page, etc.
Packet sniffer logsmonitor network traffic and extract usage data directlyfrom TCP/IP packets
User sessions/queries
User profiles, registration data, bookmarks
Web Mining: An Overview – p. 34/45
Usage Mining: Applications
Enhance server performance (caching, prefetching)Improve web site navigation (general / customized)Identify potential customers for e-commerceAdvertising:
identify potential prime advertisement locationstargeted advertising
Web Mining: An Overview – p. 35/45
Usage Mining: Desiderata
Rich datawide customer records with many potentially useful fieldsallow data mining algorithms to search beyond obviouscorrelationsrecording the actions of customers in the virtual store ismuch easier(items examined, selected, purchased)
Large volumes of datarequired to train reliable (complex) models
Controlled/reliable data collectionmanual data entry / integration from legacy systemsavoidable
Evaluation of return on investmentEase of integration
Web Mining: An Overview – p. 36/45
Usage Mining: Integrated Architecture
Customer Interfacedata collector needs to beintegrated into interfacesale transactions + other details(redirection, promotion,personalization, etc.) Analysis
CustomerInterface
BusinessData
DefinitionDeployresults
Data warehouse
Stagedata
Business Data Definitionmerchandise-related information (products, price, etc.)content information (web page templates, articles, images,multimedia)business rules (promotions / personalization / cross-sellingrules)
!! important to have rich set of metadata attributes
Web Mining: An Overview – p. 37/45
Usage Mining: Data Collection
Server/packet sniffer logs: non-intrusive but very low-level
Problems:User identification
user may use different machines/browsersuse of public access PCs / proxy servers / caching
⇒ use login ids, cookies, “negative” expiration dates, etc.Session identification: HTTP is “stateless”
login ids – painful to register at sites⇒ use timeouts (30 minutes)
Intra-page navigationData generated by CGI scripts, dynamic pagesEncryption / secure pages (for packet sniffers)
Web Mining: An Overview – p. 38/45
Usage Mining: Data Collection
Application server logs: can collect high-level informationapplication server has detailed knowledge of content sentto userserver can use cookies or URL encoding to keep track ofsessions, events, user identities
High-level information can be used to calculate:micro-conversion rates: for each step of the purchasingprocess, the fraction of products that are successfullycarried through to the next step of the purchasing process
view→ add to cart→ checkouteffectiveness of personalization: correlation betweenusing a personalization rule and shopping cart/checkoutevents compared to using control groups
Web Mining: An Overview – p. 39/45
Usage Mining: Analysis
Aggregation: needed to convert collected data into forms thatare more amenable to analysis
examples:how much money does a customer spend on books?what is the frequency of a customer’s purchases?what kind of shipping options are chosen what portion ofthe time?
may not be easy to achieve using standard aggregationtools provided by SQL, etc.
Transformation of dates:difference between order date and ship dateextract day of the week, month, quarter, season, etc.
Web Mining: An Overview – p. 40/45
Usage Mining: Analysis
Basic reporting:what are the top/worst selling products?what are the top successful / failed searches?who are the top referrers by visit count / sales amount? (*)what are the top abandoned products? (*)what is the distribution of web browsers?
Visualization toolsAssociation rule miningClassifiers (Bayesian, decision tree, etc.)Interactive model interpretation / modification tools are amust
Web Mining: An Overview – p. 41/45
Opinion Mining
Feature-based opinion summarizationIdentify the features of the product that customers haveexpressed opinions on (called opinion features)For each feature, identify how many customer reviews arepositive / negative
Examples:The pictures are very clear.
Overall a fantastic, very compact, camera.
While light, it will not easily fit in pockets. (HARD!)
Web Mining: An Overview – p. 42/45
Opinion Mining
Feature identification1. POS tagging + chunking: identify nouns, verbs, adjectives,
simple noun groups, verb groups2. Transaction creation for each sentence: item ≡ normalized
nouns / noun phrases3. Association rule mining: all itemsets with > 1% support are
candidate frequent features4. Feature pruning:
keep features that have some compact occurrenceskeep singleton itemsets only if they occur enough times inisolatione.g. manual vs. manual mode, manual setting
5. Infrequent feature identification: noun/noun phrase thatoccurs closest to a known opinion word
Web Mining: An Overview – p. 43/45
Opinion Mining
Sentiment / orientation identification1. Examine each sentence in the review database2. If it contains a frequent feature, extract all the adjective words
as opinion words3. For each feature in the sentence, the nearby adjective is
recorded as its effective opinion4. Look up adjective in a list of adjectives with known
orientation, or consult WordNet (discard unknowns)adjectives arranged in bipolar structures
Web Mining: An Overview – p. 44/45
44-1
44-2
44-3
44-4
44-5
References
Web Mining Research: A Survey, R. Kosala, H. Blockeel,SIGKDD Explorations, 2(1), July, 2000.Web Usage Mining: Discovery and Applications of UsagePatterns from Web Data, J. Srivastava, R. Cooley, M.Deshpande, P. Tan, SIGKDD Explorations, 1(2), Jan, 2000.WEBKDD 2000 Worskhop on Web Mining for E-Commerce –Challenges and Opportunities.http://robotics.stanford.edu/~ronnyk/WEBKDD2000/papers/
Data Mining and Knowledge Discovery, 5 (2001), 6 (2002).Communications of the ACM, 45(8), August 2002.Machine Learning, 57, 2004.14th International World Wide Web Conference (WWW2005) Tutorial on Web Content Mining (Bing Liu)http://www.cs.uic.edu/~liub
Web Mining: An Overview – p. 45/45