pattern reduction and information retrieval

47
LOGO www.themegallery.com Pattern Reduction and Information Retrieval 國國國國國國 國國國國國國 楊楊楊

Upload: sierra

Post on 21-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Pattern Reduction and Information Retrieval. www.themegallery.com. 國立成功大學 電機工程學系 楊竹星. Outline. Pattern Reduction Information Retrieval Conclusion. Combinatorial Optimization Problem. Complex Problems NP-complete problem (Time) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Reduction and  Information Retrieval

LOGO

www.themegallery.comPattern Reduction and Information Retrieval

國立成功大學 電機工程學系

楊竹星

Page 2: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Outline

Pattern ReductionInformation RetrievalConclusion

Page 3: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Combinatorial Optimization Problem

Complex Problems NP-complete problem (Time)

• No optimum solution can be found in a reasonable time with limited computing resources.

• E.g., Traveling Salesman Problem

Large scale problem (Space)• In general, this kind of problem cannot be handled

efficiently with limited memory space.• E.g., Data Clustering Problem, astronomy, MRI

Page 4: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Combinatorial Optimization Problem and MetaheuristicsTraveling Salesman Problem (n!)

Shortest Routing Path

Path 1:

Path 2:

Page 5: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

An example-Bulls and cows

Check all candidate solutions

Guess Feedback Deduction Secret number: 9305 Opponent's try: 1234

• 0A1B

• 1234

Opponent's try: 5678• 0A1B

• 5678

number 0 and 9 must be the secret number

from wiki

Transition

Evaluation

Determination

Transition

Evaluation

Determination

Page 6: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Concept (1/4)

Our observation shows that a lot of computations of most, if not all, of the metaheuristic algorithms during their convergence process are redundant.

6

(Data courtesy of Su and Chang)

Page 7: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Concept (2/4)

7

Page 8: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Concept (3/4)

8

Page 9: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Concept (4/4)

0 0 1 0

1 1 1 0

C1

C2

.

.

.

g =1, s =4

g =2, s =4

g =n, s =4

0 0 1 0

1 1 1 0

C1

C2

g=1, s =4

1 0

0 1

C1

C2

0 1

1 1

C1

C2

.

.

.

g =2, s =2

g =n, s =2

Metaheuristics Metaheuristics +

Pattern Reduction

1 0 1 0

0 1 1 0

C1

C2

0 1 1 0

1 1 1 0

C1

C2

Page 10: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

The Proposed Algorithm

Create the initial solutions P = {p1, p2, . . . , pn}

While termination criterion is not metApply the transition, evaluation, and determination operators of

the metaheuristics in question to P

/* Begin PR */

Detect the sub−solutions R = {r1, r2, . . . , rm} that have a high probability not to be changed

Compress the sub−solutions in R into a single pattern, say, c

Remove the sub−solutions in R from P; that is, P = P \ R

P = P ∪ {c}

/* End PR */

End

10

Page 11: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Detection

Time-Oriented Detect patterns not changed in a certain

number of iterations aka static patterns

Space-Oriented Detect sub-solutions that are common at

certain loci

Problem-Specific E.g., for the k-means, we are assuming

that patterns near a centroid are unlikely to be reassigned to another cluster.

11

P1: 1352476

P2: 7352614 …

T1

P1: 1 C1 476

P2: 7 C1 614

Tn

T1: 1352476

T2: 7352614

T3: 7352416

Tn: 7 C1 416

Page 12: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

An Example

Page 13: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Time Complexity

Ideally, the running time of “k-means with PR” is independent of the number of iterations.

In reality, however, our experimental result shows that setting the removal bound to 80% gives the best result.

where n is the number of patterns, k the number of clusters, l the number of iterations, and d number of dimensions.

13

Page 14: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Outline

Pattern ReductionInformation RetrievalConclusion

Page 15: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Instruction

Over the past decade, computer has transformed traditional printed material into digital material.

The internet technology make the most of information and knowledge can be searched and used by anyone.

Acquiring knowledge is no longer limited by geography, as a search engine can be shared and used by anyone, anywhere, anytime, using any internet browsing software.

Printed MaterialPrinted Material Digital MaterialDigital Material DatabaseDatabase Web PagesWeb Pages

Page 16: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

The history

1990

InternetInternet

Data Analysis and MiningData Analysis and Mining

Web Information ExtractionWeb Information Extraction

Web Information RetrievalWeb Information Retrieval

Web MiningWeb Mining

ApplicationApplication

1960

1980

1990

Printed MaterialDigital Material

Digital Material

Web Pages

Page 17: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Problem

Due to the growth of online information, a large number of file have flooded the internet.

We can easy get the information that we need, but spend too much time to seek out the relevant information.

==Internet

Library

Page 18: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Problem (cont.)

The user always can not handle the large number of internet information.

Page 19: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web IR

Information Retrieval (IR) The goal of IR is retrieve documents with content that is relevant

to user’s need and find the relationship between the documents.

Rieh and Xie pointed that Information Retrieval is an interactive and iterative process.

A collection of documents is a set of documents which is related to a specific context of interest.

Research on information retrieval covers a very broad area including the dependence analysis of a group of files, the clustering of files and the classification of files.

Page 20: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

IR and Web IR

Traditional IR to Web IR New information sources:

• Digital Data Web Page, Database, Internet, etc.

New media types: • Text HTML, Image, Video, Audio.

New applications• File or Data Web Search, Video Search, Audio Search.

However, the major difference between the classic information retrieval (CIR) and web information retrieval (Web IR) is that faced the different data sets.

Page 21: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

A taxonomy of information retrieval models

User Task

Retrieval: Ad hoc Filtering

Browsing

Class Model

BooleanVector

Probabilistic

Structured Models

Non-Overlapping ListsProximal Nodes

Browsing

FlatStructure Guided

Hypertext

Set Theoretic

FuzzyExtended Boolean

Algebraic

Generalized VectorLat. Semantic Index

Neural Networks

Probabilistic

Inference NetworkBelief Network

R. Baeze-Yates

Page 22: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Document Similarity

Vector Space Model, VSM G. Salton and M.E. Lesk, 1968 The cosine of θ is the similarity of the

document j and q

Page 23: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web Information Extraction

Information Extraction (IE) Wrapper ( human ): Given a set of manually labeled

pages, a machine learning method is applied to learn extraction rules or patterns

IE ( automation ) • Given a set of positive pages, generate extraction patterns• Given only a single page with multiple data records, generate

extraction patterns

Assumption : data having a structure or a schema

Integrate the data present in different web sites

Page 24: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

A example of IE

IB

IB

IB

IB

IB

IB: Information Block

Page 25: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web Mining

Data mining is to mine knowledge from data, but web mining is mining information from World Wide Web.

Web mining broadly defined as the discovery and analysis of useful information from the Web

Web Mining can be separate as: Web usage mining Web content mining Web structure mining

Page 26: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web Mining Process

2. Wrapper (extract rules)

3.patterns

1.webpage5. Information or Knowledge

4. Mining

Page 27: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web Structure Mining

To generate structural summary about the web site and web page

Try to discover the link structure of the hyperlinks at the web pages

To reveal the more information than about the information contained in web pages

Page 28: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Web structure mining – PageRank (Google)

S Brin

30

30

30

3

3

3

30+3=33

30+3+3=36

33/4 = 8.25

36/2 = 18

9

Page 29: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

CSES – A Cluster Search Engine

Motivation Search Engine Operations Other Types of Search Engines

Meta-search Engine Clustering Search Engine

CSES: A Clustering Search Engine System Framework Clustering Algorithm User Interface

Page 30: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Motivation

The search engine is an information retrieval system designed to help user find information on a computer.

Problem:For example, if “mp3” was given to a search engine, it could mean an “mp3 music file” or an “mp3 player.” Another example is when the keyword “cat” (meaning a cat) is given as a query to the Google search engine,3 the first item returned is the company “Caterpillar, Inc.,” which has nothing to do with the animal “cat.”

Page 31: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Search Engine Operations

Web crawling (Web spider, Web robot) An automated Web browser which follows every link it sees

Indexing file. Indexing

The contents of each page are then analyzed to determine how it should be indexed (titles, headings, or meta tags). Data about web pages are stored in an index database for use in later queries.

Searching When a user enters a query into a search engine (typically by

using keywords), the engine checks its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.

Page 32: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Google Search Engine

Google stores and indexes data with Shared nothing architecture (distributed computing architecture where each node is independent and self-sufficient), called Google file system.

After Google announced IPO S-1 form in April 2004, Tristan Louis (the founder of Internet.com) estimates that Google’s server includes: 63,272 computers 126,544 processors 253,088 GHz workload 126,544 GB memory 5,062 TB storage

Page 33: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Search Engine Ranking

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of Web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others.

Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. Ranking algorithm.

Page 34: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Meta-search Engine

Search engines are measured by Coverage Recency (Freshness)

How to improve the Coverage? Meta-search engine P2P platform Crawling hidden pages

Meta-search Engine:You submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages.

Page 35: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Meta-search Engine (Cont.)

Google

query

result

Meta-search Engine System

User Interface

Yahoo!

MSN

altavista

overture

Other search engines

Parser

Page 36: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Clustering Search Engine

Search engine (even meta-search engine) could return a huge amount of ranked lists of Web pages. However, this method is highly inefficient. Search results can be in the millions for a typical query. The criteria used for the ranking may not reflect the needs of the

user. A majority of the queries tend to be short, thus making them

non-specific or imprecise.

By clustering the search results, users could find the ones which they really want efficiently and correctly.

Page 37: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Clustering Search Engine (Cont.)

Clustering search engines’ rising: Usually built on the meta-search engine. Clustering search results provide a better way

to help users find information quickly. Famous clustering search engines:

• Vivisimo, SnakeT, iBoogie, KartOO, Grokker…,etc.

Page 38: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Results of Traditional Search Engines

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 2

Taxonomy 3

Taxonomy 3

Taxonomy 2

Relevant Info

Irrelevant Info

Page 39: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Results of Clustering Search Engines

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 1

Taxonomy 2

Taxonomy 2

Taxonomy 2

Taxonomy 3

Summary A Summary B Summary C

Relevant Info

Irrelevant Info

Page 40: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Vivisimo

Page 41: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

iBoogie

Page 42: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

KartOO

Page 43: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Clustering Search Engine System : CSES

We proposed a novel Clustering Search Engine System, called CSES. The information coverage provided by search engines The relevance of information offered by directory

search systems.

We proposed a simple but novel algorithm for clustering the web pages. This algorithm is fundamentally different from traditional clustering algorithms that require a tremendous amount of computation time.

Page 44: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Taxonomy Information System

CSES : Framework

Meta-search Engine Meta-Directory System

Web sites

Directory Tree

Yahoo! Dir Google Dir

query

clustering

result

Data Grid

Grid Computing

GoogleYahoo! MSN ODP Dir

Page 45: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

CSES: Meta-Directory Tree Based Clustering

Compare the Similarity MA1

Compare the Similarity MA2

Taxonomy: Tax1Sub tax3Ex: MusicBand

Directory tree (ODP, Yahoo! and Google)

Web sites

Similarity computation of MA1 and MA2 is based on the term frequency

Page 46: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

CSES: User Interface

Tree Structure of ClustersSearch Results

Input Area

Page 47: Pattern Reduction and  Information Retrieval

http://itlab.ee.ncku.edu.tw/

Future Work of CSES

The problem of Cluster Search Engine Computation Load (response time) Accuracy (relevant) Information Display (user interface)

Grid computing, distributed computing

Social network

Applied this framework to other areas