web information retrieval and mining

Download Web Information Retrieval and Mining

If you can't read please download the document

Upload: carlos-castillo-chato

Post on 16-Apr-2017

8.899 views

Category:

Technology


5 download

TRANSCRIPT

Web Retrieval and Mining Overview

Source: Ricardo Baeza-Yates and Carlos Castillo: Web Retrieval and Mining.Entry in Encyclopedia of Library and Information Sciences, third edition (to appear in 2009).

Information Retrieval

Methods for finding information in documents

Started in the 1970s and 1980s

Methods

Algorithms and heuristics

Finding

Query Document, Document Document, etc.

Documents

Texts

The Web is different

Massive

Thousands of millions of documents

Dynamic

Updates

Deletes

Distributed

Variable quality

Malicious behavior

Web IR topics

Web Search

Crawling

Indexing

Querying

Web Mining

Adversarial Web IR

Distributed Web IR

Evaluation

Web search

Main goals

Precision

Relevant documents returned / Documents returned

Recall

Relevant documents returned / Relevant documents

Freshness

Performance/scalability

Main goals

Two phases of search

Off-line

Crawling and indexing

On-line

Querying and ranking

Search phases

Web crawling

Download pages following rules

Applications

Create index for search

Find particular information items

Find/report problems

Constraints

Robot exclusion protocol and politeness

Deep web

Web indexing

Logical view

Tokenization

Stopwords removal

Stemming

Creation of an inverted index

Inverted index

Challenges of indexing

Index compression

Efficiency in top-K searches

Sorting

Index distribution

By terms

By documents

Web querying and ranking

Keyword-based search is dominant paradigm

No large-scale open-domain QA systems (yet)

Relevance

Vector space model and variants

Query expansion

Latent semantic indexing

Web ranking

Quality is the main problem

Link ranking

Hypothesis 1: Topical locality of links

Hypothesis 2: Link implies endorsment

PageRank

HITS

HITS

Rank manipulation

The bubble of Web visibility

Content spam

Keyword stuffing

Content hidding

Link spam

Link farms

Cloaking

Web mining

Content mining

Extraction of knowledge from Web pages

BUT ... HTML is physical formatting

There is information loss

Information loss

Aspects of content mining

Information extraction

Revert information loss

Content classification

Topic

Genre

Sentiment analysis

Link mining

Scale-free networks

Macroscopic view

Bow-tie structure

Usage mining

Logfile analysis

Query logs

Privacy issues

Emerging topics

Mobile Web

Semantic Web

...

Muokkaa otsikon tekstimuotoa napsauttamalla

Muokkaa jsennyksen tekstimuotoa napsauttamalla

Toinen jsennystaso

Kolmas jsennystaso

Neljs jsennystaso

Viides jsennystaso

Kuudes jsennystaso

Seitsems jsennystaso

Kahdeksas jsennystaso

Yhdekss jsennystaso