web information retrieval and mining
TRANSCRIPT
Web Retrieval and Mining Overview
Source: Ricardo Baeza-Yates and Carlos Castillo: Web Retrieval and
Mining.Entry in Encyclopedia of Library and Information Sciences,
third edition (to appear in 2009).
Information Retrieval
Methods for finding information in documents
Started in the 1970s and 1980s
Methods
Algorithms and heuristics
Finding
Query Document, Document Document, etc.
Documents
Texts
The Web is different
Massive
Thousands of millions of documents
Dynamic
Updates
Deletes
Distributed
Variable quality
Malicious behavior
Web IR topics
Web Search
Crawling
Indexing
Querying
Web Mining
Adversarial Web IR
Distributed Web IR
Evaluation
Web search
Main goals
Precision
Relevant documents returned / Documents returned
Recall
Relevant documents returned / Relevant documents
Freshness
Performance/scalability
Main goals
Two phases of search
Off-line
Crawling and indexing
On-line
Querying and ranking
Search phases
Web crawling
Download pages following rules
Applications
Create index for search
Find particular information items
Find/report problems
Constraints
Robot exclusion protocol and politeness
Deep web
Web indexing
Logical view
Tokenization
Stopwords removal
Stemming
Creation of an inverted index
Inverted index
Challenges of indexing
Index compression
Efficiency in top-K searches
Sorting
Index distribution
By terms
By documents
Web querying and ranking
Keyword-based search is dominant paradigm
No large-scale open-domain QA systems (yet)
Relevance
Vector space model and variants
Query expansion
Latent semantic indexing
Web ranking
Quality is the main problem
Link ranking
Hypothesis 1: Topical locality of links
Hypothesis 2: Link implies endorsment
PageRank
HITS
HITS
Rank manipulation
The bubble of Web visibility
Content spam
Keyword stuffing
Content hidding
Link spam
Link farms
Cloaking
Web mining
Content mining
Extraction of knowledge from Web pages
BUT ... HTML is physical formatting
There is information loss
Information loss
Aspects of content mining
Information extraction
Revert information loss
Content classification
Topic
Genre
Sentiment analysis
Link mining
Scale-free networks
Macroscopic view
Bow-tie structure
Usage mining
Logfile analysis
Query logs
Privacy issues
Emerging topics
Mobile Web
Semantic Web
...
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamalla
Toinen jsennystaso
Kolmas jsennystaso
Neljs jsennystaso
Viides jsennystaso
Kuudes jsennystaso
Seitsems jsennystaso
Kahdeksas jsennystaso
Yhdekss jsennystaso