pattern reduction and information retrieval

LOGO

www.themegallery.comPattern Reduction and Information Retrieval

國立成功大學電機工程學系

楊竹星

http://itlab.ee.ncku.edu.tw/

Outline

Pattern ReductionInformation RetrievalConclusion


Combinatorial Optimization Problem

Complex Problems NP-complete problem (Time)

• No optimum solution can be found in a reasonable time with limited computing resources.

• E.g., Traveling Salesman Problem

Large scale problem (Space)• In general, this kind of problem cannot be handled

efficiently with limited memory space.• E.g., Data Clustering Problem, astronomy, MRI


Combinatorial Optimization Problem and MetaheuristicsTraveling Salesman Problem (n!)

Shortest Routing Path

Path 1:

Path 2:


An example-Bulls and cows

Check all candidate solutions

Guess Feedback Deduction Secret number: 9305 Opponent's try: 1234

• 0A1B

• 1234

Opponent's try: 5678• 0A1B

• 5678

number 0 and 9 must be the secret number

from wiki

Transition

Evaluation

Determination

Transition

Evaluation

Determination


Concept (1/4)

Our observation shows that a lot of computations of most, if not all, of the metaheuristic algorithms during their convergence process are redundant.

6

(Data courtesy of Su and Chang)


Concept (2/4)

7


Concept (3/4)

8


Concept (4/4)

0 0 1 0

1 1 1 0

C1

C2

.

.

.

g =1, s =4

g =2, s =4

g =n, s =4

0 0 1 0

1 1 1 0

C1

C2

g=1, s =4

1 0

0 1

C1

C2

0 1

1 1

C1

C2

.

.

.

g =2, s =2

g =n, s =2

Metaheuristics Metaheuristics +

Pattern Reduction

1 0 1 0

0 1 1 0

C1

C2

0 1 1 0

1 1 1 0

C1

C2


The Proposed Algorithm

Create the initial solutions P = {p1, p2, . . . , pn}

While termination criterion is not metApply the transition, evaluation, and determination operators of

the metaheuristics in question to P

/* Begin PR */

Detect the sub−solutions R = {r1, r2, . . . , rm} that have a high probability not to be changed

Compress the sub−solutions in R into a single pattern, say, c

Remove the sub−solutions in R from P; that is, P = P \ R

P = P ∪ {c}

/* End PR */

End

10


Detection

Time-Oriented Detect patterns not changed in a certain

number of iterations aka static patterns

Space-Oriented Detect sub-solutions that are common at

certain loci

Problem-Specific E.g., for the k-means, we are assuming

that patterns near a centroid are unlikely to be reassigned to another cluster.

11

P1: 1352476

P2: 7352614 …

T1

P1: 1 C1 476

P2: 7 C1 614

Tn

T1: 1352476

T2: 7352614

T3: 7352416

Tn: 7 C1 416

…


An Example


Time Complexity

Ideally, the running time of “k-means with PR” is independent of the number of iterations.

In reality, however, our experimental result shows that setting the removal bound to 80% gives the best result.

where n is the number of patterns, k the number of clusters, l the number of iterations, and d number of dimensions.

13


Outline

Pattern ReductionInformation RetrievalConclusion


Instruction

Over the past decade, computer has transformed traditional printed material into digital material.

The internet technology make the most of information and knowledge can be searched and used by anyone.

Acquiring knowledge is no longer limited by geography, as a search engine can be shared and used by anyone, anywhere, anytime, using any internet browsing software.

Printed MaterialPrinted Material Digital MaterialDigital Material DatabaseDatabase Web PagesWeb Pages


The history

1990

InternetInternet

Data Analysis and MiningData Analysis and Mining

Web Information ExtractionWeb Information Extraction

Web Information RetrievalWeb Information Retrieval

Web MiningWeb Mining

ApplicationApplication

1960

1980

1990

Printed MaterialDigital Material

Digital Material

Web Pages


Problem

Due to the growth of online information, a large number of file have flooded the internet.

We can easy get the information that we need, but spend too much time to seek out the relevant information.

＝＝Internet

Library


Problem (cont.)

The user always can not handle the large number of internet information.


Web IR

Information Retrieval (IR) The goal of IR is retrieve documents with content that is relevant

to user’s need and find the relationship between the documents.

Rieh and Xie pointed that Information Retrieval is an interactive and iterative process.

A collection of documents is a set of documents which is related to a specific context of interest.

Research on information retrieval covers a very broad area including the dependence analysis of a group of files, the clustering of files and the classification of files.


IR and Web IR

Traditional IR to Web IR New information sources:

• Digital Data Web Page, Database, Internet, etc.

New media types: • Text HTML, Image, Video, Audio.

New applications• File or Data Web Search, Video Search, Audio Search.

However, the major difference between the classic information retrieval (CIR) and web information retrieval (Web IR) is that faced the different data sets.


A taxonomy of information retrieval models

User Task

Retrieval: Ad hoc Filtering

Browsing

Class Model

BooleanVector

Probabilistic

Structured Models

Non-Overlapping ListsProximal Nodes

Browsing

FlatStructure Guided

Hypertext

Set Theoretic

FuzzyExtended Boolean

Algebraic

Generalized VectorLat. Semantic Index

Neural Networks

Probabilistic

Inference NetworkBelief Network

R. Baeze-Yates


Document Similarity

Vector Space Model, VSM G. Salton and M.E. Lesk, 1968 The cosine of θ is the similarity of the

document j and q


Web Information Extraction

Information Extraction (IE) Wrapper ( human ): Given a set of manually labeled

pages, a machine learning method is applied to learn extraction rules or patterns

IE ( automation ) • Given a set of positive pages, generate extraction patterns• Given only a single page with multiple data records, generate

extraction patterns

Assumption : data having a structure or a schema

Integrate the data present in different web sites


A example of IE

IB

IB

IB

IB

IB

IB: Information Block


Web Mining

Data mining is to mine knowledge from data, but web mining is mining information from World Wide Web.

Web mining broadly defined as the discovery and analysis of useful information from the Web

Web Mining can be separate as: Web usage mining Web content mining Web structure mining


Web Mining Process

2. Wrapper (extract rules)

3.patterns

1.webpage5. Information or Knowledge

4. Mining


Web Structure Mining

To generate structural summary about the web site and web page

Try to discover the link structure of the hyperlinks at the web pages

To reveal the more information than about the information contained in web pages


Web structure mining – PageRank (Google)

S Brin

30

30

30

3

3

3

30+3=33

30+3+3=36

33/4 = 8.25

36/2 = 18

9


CSES – A Cluster Search Engine

Motivation Search Engine Operations Other Types of Search Engines

Meta-search Engine Clustering Search Engine

CSES: A Clustering Search Engine System Framework Clustering Algorithm User Interface


Motivation

The search engine is an information retrieval system designed to help user find information on a computer.

Problem:For example, if “mp3” was given to a search engine, it could mean an “mp3 music file” or an “mp3 player.” Another example is when the keyword “cat” (meaning a cat) is given as a query to the Google search engine,3 the first item returned is the company “Caterpillar, Inc.,” which has nothing to do with the animal “cat.”


Search Engine Operations

Web crawling (Web spider, Web robot) An automated Web browser which follows every link it sees

Indexing file. Indexing

The contents of each page are then analyzed to determine how it should be indexed (titles, headings, or meta tags). Data about web pages are stored in an index database for use in later queries.

Searching When a user enters a query into a search engine (typically by

using keywords), the engine checks its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.


Google Search Engine

Google stores and indexes data with Shared nothing architecture (distributed computing architecture where each node is independent and self-sufficient), called Google file system.

After Google announced IPO S-1 form in April 2004, Tristan Louis (the founder of Internet.com) estimates that Google’s server includes: 63,272 computers 126,544 processors 253,088 GHz workload 126,544 GB memory 5,062 TB storage


Search Engine Ranking

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of Web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others.

Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. Ranking algorithm.


Meta-search Engine

Search engines are measured by Coverage Recency (Freshness)

How to improve the Coverage? Meta-search engine P2P platform Crawling hidden pages

Meta-search Engine:You submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages.


Meta-search Engine (Cont.)

Google

query

result

Meta-search Engine System

User Interface

Yahoo!

MSN

altavista

overture

Other search engines

…

Parser


Clustering Search Engine

Search engine (even meta-search engine) could return a huge amount of ranked lists of Web pages. However, this method is highly inefficient. Search results can be in the millions for a typical query. The criteria used for the ranking may not reflect the needs of the

user. A majority of the queries tend to be short, thus making them

non-specific or imprecise.

By clustering the search results, users could find the ones which they really want efficiently and correctly.


Clustering Search Engine (Cont.)

Clustering search engines’ rising: Usually built on the meta-search engine. Clustering search results provide a better way

to help users find information quickly. Famous clustering search engines:

• Vivisimo, SnakeT, iBoogie, KartOO, Grokker…,etc.


Results of Traditional Search Engines

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 2

Taxonomy 3

Taxonomy 3

Taxonomy 2

Relevant Info

Irrelevant Info


Results of Clustering Search Engines

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 2

Taxonomy 2

Taxonomy 2

Taxonomy 3

Summary A Summary B Summary C

Relevant Info

Irrelevant Info


Vivisimo


iBoogie


KartOO


Clustering Search Engine System : CSES

We proposed a novel Clustering Search Engine System, called CSES. The information coverage provided by search engines The relevance of information offered by directory

search systems.

We proposed a simple but novel algorithm for clustering the web pages. This algorithm is fundamentally different from traditional clustering algorithms that require a tremendous amount of computation time.


Taxonomy Information System

CSES : Framework

Meta-search Engine Meta-Directory System

Web sites

Directory Tree

Yahoo! Dir Google Dir

query

clustering

result

Data Grid

Grid Computing

GoogleYahoo! MSN ODP Dir


CSES: Meta-Directory Tree Based Clustering

Compare the Similarity MA1

Compare the Similarity MA2

Taxonomy: Tax1Sub tax3Ex: MusicBand

Directory tree (ODP, Yahoo! and Google)

Web sites

Similarity computation of MA1 and MA2 is based on the term frequency


CSES: User Interface

Tree Structure of ClustersSearch Results

Input Area


Future Work of CSES

The problem of Cluster Search Engine Computation Load (response time) Accuracy (relevant) Information Display (user interface)

Grid computing, distributed computing

Social network

Applied this framework to other areas

pattern reduction and information retrieval

Documents