pattern reduction and information retrieval
DESCRIPTION
Pattern Reduction and Information Retrieval. www.themegallery.com. 國立成功大學 電機工程學系 楊竹星. Outline. Pattern Reduction Information Retrieval Conclusion. Combinatorial Optimization Problem. Complex Problems NP-complete problem (Time) - PowerPoint PPT PresentationTRANSCRIPT
LOGO
www.themegallery.comPattern Reduction and Information Retrieval
國立成功大學 電機工程學系
楊竹星
http://itlab.ee.ncku.edu.tw/
Outline
Pattern ReductionInformation RetrievalConclusion
http://itlab.ee.ncku.edu.tw/
Combinatorial Optimization Problem
Complex Problems NP-complete problem (Time)
• No optimum solution can be found in a reasonable time with limited computing resources.
• E.g., Traveling Salesman Problem
Large scale problem (Space)• In general, this kind of problem cannot be handled
efficiently with limited memory space.• E.g., Data Clustering Problem, astronomy, MRI
http://itlab.ee.ncku.edu.tw/
Combinatorial Optimization Problem and MetaheuristicsTraveling Salesman Problem (n!)
Shortest Routing Path
Path 1:
Path 2:
http://itlab.ee.ncku.edu.tw/
An example-Bulls and cows
Check all candidate solutions
Guess Feedback Deduction Secret number: 9305 Opponent's try: 1234
• 0A1B
• 1234
Opponent's try: 5678• 0A1B
• 5678
number 0 and 9 must be the secret number
from wiki
Transition
Evaluation
Determination
Transition
Evaluation
Determination
http://itlab.ee.ncku.edu.tw/
Concept (1/4)
Our observation shows that a lot of computations of most, if not all, of the metaheuristic algorithms during their convergence process are redundant.
6
(Data courtesy of Su and Chang)
http://itlab.ee.ncku.edu.tw/
Concept (2/4)
7
http://itlab.ee.ncku.edu.tw/
Concept (3/4)
8
http://itlab.ee.ncku.edu.tw/
Concept (4/4)
0 0 1 0
1 1 1 0
C1
C2
.
.
.
g =1, s =4
g =2, s =4
g =n, s =4
0 0 1 0
1 1 1 0
C1
C2
g=1, s =4
1 0
0 1
C1
C2
0 1
1 1
C1
C2
.
.
.
g =2, s =2
g =n, s =2
Metaheuristics Metaheuristics +
Pattern Reduction
1 0 1 0
0 1 1 0
C1
C2
0 1 1 0
1 1 1 0
C1
C2
http://itlab.ee.ncku.edu.tw/
The Proposed Algorithm
Create the initial solutions P = {p1, p2, . . . , pn}
While termination criterion is not metApply the transition, evaluation, and determination operators of
the metaheuristics in question to P
/* Begin PR */
Detect the sub−solutions R = {r1, r2, . . . , rm} that have a high probability not to be changed
Compress the sub−solutions in R into a single pattern, say, c
Remove the sub−solutions in R from P; that is, P = P \ R
P = P ∪ {c}
/* End PR */
End
10
http://itlab.ee.ncku.edu.tw/
Detection
Time-Oriented Detect patterns not changed in a certain
number of iterations aka static patterns
Space-Oriented Detect sub-solutions that are common at
certain loci
Problem-Specific E.g., for the k-means, we are assuming
that patterns near a centroid are unlikely to be reassigned to another cluster.
11
P1: 1352476
P2: 7352614 …
T1
P1: 1 C1 476
P2: 7 C1 614
Tn
T1: 1352476
T2: 7352614
T3: 7352416
Tn: 7 C1 416
…
http://itlab.ee.ncku.edu.tw/
An Example
http://itlab.ee.ncku.edu.tw/
Time Complexity
Ideally, the running time of “k-means with PR” is independent of the number of iterations.
In reality, however, our experimental result shows that setting the removal bound to 80% gives the best result.
where n is the number of patterns, k the number of clusters, l the number of iterations, and d number of dimensions.
13
http://itlab.ee.ncku.edu.tw/
Outline
Pattern ReductionInformation RetrievalConclusion
http://itlab.ee.ncku.edu.tw/
Instruction
Over the past decade, computer has transformed traditional printed material into digital material.
The internet technology make the most of information and knowledge can be searched and used by anyone.
Acquiring knowledge is no longer limited by geography, as a search engine can be shared and used by anyone, anywhere, anytime, using any internet browsing software.
Printed MaterialPrinted Material Digital MaterialDigital Material DatabaseDatabase Web PagesWeb Pages
http://itlab.ee.ncku.edu.tw/
The history
1990
InternetInternet
Data Analysis and MiningData Analysis and Mining
Web Information ExtractionWeb Information Extraction
Web Information RetrievalWeb Information Retrieval
Web MiningWeb Mining
ApplicationApplication
1960
1980
1990
Printed MaterialDigital Material
Digital Material
Web Pages
http://itlab.ee.ncku.edu.tw/
Problem
Due to the growth of online information, a large number of file have flooded the internet.
We can easy get the information that we need, but spend too much time to seek out the relevant information.
==Internet
Library
http://itlab.ee.ncku.edu.tw/
Problem (cont.)
The user always can not handle the large number of internet information.
http://itlab.ee.ncku.edu.tw/
Web IR
Information Retrieval (IR) The goal of IR is retrieve documents with content that is relevant
to user’s need and find the relationship between the documents.
Rieh and Xie pointed that Information Retrieval is an interactive and iterative process.
A collection of documents is a set of documents which is related to a specific context of interest.
Research on information retrieval covers a very broad area including the dependence analysis of a group of files, the clustering of files and the classification of files.
http://itlab.ee.ncku.edu.tw/
IR and Web IR
Traditional IR to Web IR New information sources:
• Digital Data Web Page, Database, Internet, etc.
New media types: • Text HTML, Image, Video, Audio.
New applications• File or Data Web Search, Video Search, Audio Search.
However, the major difference between the classic information retrieval (CIR) and web information retrieval (Web IR) is that faced the different data sets.
http://itlab.ee.ncku.edu.tw/
A taxonomy of information retrieval models
User Task
Retrieval: Ad hoc Filtering
Browsing
Class Model
BooleanVector
Probabilistic
Structured Models
Non-Overlapping ListsProximal Nodes
Browsing
FlatStructure Guided
Hypertext
Set Theoretic
FuzzyExtended Boolean
Algebraic
Generalized VectorLat. Semantic Index
Neural Networks
Probabilistic
Inference NetworkBelief Network
R. Baeze-Yates
http://itlab.ee.ncku.edu.tw/
Document Similarity
Vector Space Model, VSM G. Salton and M.E. Lesk, 1968 The cosine of θ is the similarity of the
document j and q
http://itlab.ee.ncku.edu.tw/
Web Information Extraction
Information Extraction (IE) Wrapper ( human ): Given a set of manually labeled
pages, a machine learning method is applied to learn extraction rules or patterns
IE ( automation ) • Given a set of positive pages, generate extraction patterns• Given only a single page with multiple data records, generate
extraction patterns
Assumption : data having a structure or a schema
Integrate the data present in different web sites
http://itlab.ee.ncku.edu.tw/
A example of IE
IB
IB
IB
IB
IB
IB: Information Block
http://itlab.ee.ncku.edu.tw/
Web Mining
Data mining is to mine knowledge from data, but web mining is mining information from World Wide Web.
Web mining broadly defined as the discovery and analysis of useful information from the Web
Web Mining can be separate as: Web usage mining Web content mining Web structure mining
http://itlab.ee.ncku.edu.tw/
Web Mining Process
2. Wrapper (extract rules)
3.patterns
1.webpage5. Information or Knowledge
4. Mining
http://itlab.ee.ncku.edu.tw/
Web Structure Mining
To generate structural summary about the web site and web page
Try to discover the link structure of the hyperlinks at the web pages
To reveal the more information than about the information contained in web pages
http://itlab.ee.ncku.edu.tw/
Web structure mining – PageRank (Google)
S Brin
30
30
30
3
3
3
30+3=33
30+3+3=36
33/4 = 8.25
36/2 = 18
9
http://itlab.ee.ncku.edu.tw/
CSES – A Cluster Search Engine
Motivation Search Engine Operations Other Types of Search Engines
Meta-search Engine Clustering Search Engine
CSES: A Clustering Search Engine System Framework Clustering Algorithm User Interface
http://itlab.ee.ncku.edu.tw/
Motivation
The search engine is an information retrieval system designed to help user find information on a computer.
Problem:For example, if “mp3” was given to a search engine, it could mean an “mp3 music file” or an “mp3 player.” Another example is when the keyword “cat” (meaning a cat) is given as a query to the Google search engine,3 the first item returned is the company “Caterpillar, Inc.,” which has nothing to do with the animal “cat.”
http://itlab.ee.ncku.edu.tw/
Search Engine Operations
Web crawling (Web spider, Web robot) An automated Web browser which follows every link it sees
Indexing file. Indexing
The contents of each page are then analyzed to determine how it should be indexed (titles, headings, or meta tags). Data about web pages are stored in an index database for use in later queries.
Searching When a user enters a query into a search engine (typically by
using keywords), the engine checks its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.
http://itlab.ee.ncku.edu.tw/
Google Search Engine
Google stores and indexes data with Shared nothing architecture (distributed computing architecture where each node is independent and self-sufficient), called Google file system.
After Google announced IPO S-1 form in April 2004, Tristan Louis (the founder of Internet.com) estimates that Google’s server includes: 63,272 computers 126,544 processors 253,088 GHz workload 126,544 GB memory 5,062 TB storage
http://itlab.ee.ncku.edu.tw/
Search Engine Ranking
The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of Web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others.
Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. Ranking algorithm.
http://itlab.ee.ncku.edu.tw/
Meta-search Engine
Search engines are measured by Coverage Recency (Freshness)
How to improve the Coverage? Meta-search engine P2P platform Crawling hidden pages
Meta-search Engine:You submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages.
http://itlab.ee.ncku.edu.tw/
Meta-search Engine (Cont.)
query
result
Meta-search Engine System
User Interface
Yahoo!
MSN
altavista
overture
Other search engines
…
Parser
http://itlab.ee.ncku.edu.tw/
Clustering Search Engine
Search engine (even meta-search engine) could return a huge amount of ranked lists of Web pages. However, this method is highly inefficient. Search results can be in the millions for a typical query. The criteria used for the ranking may not reflect the needs of the
user. A majority of the queries tend to be short, thus making them
non-specific or imprecise.
By clustering the search results, users could find the ones which they really want efficiently and correctly.
http://itlab.ee.ncku.edu.tw/
Clustering Search Engine (Cont.)
Clustering search engines’ rising: Usually built on the meta-search engine. Clustering search results provide a better way
to help users find information quickly. Famous clustering search engines:
• Vivisimo, SnakeT, iBoogie, KartOO, Grokker…,etc.
http://itlab.ee.ncku.edu.tw/
Results of Traditional Search Engines
Taxonomy 1
Taxonomy 1
Taxonomy 1
Taxonomy 2
Taxonomy 3
Taxonomy 3
Taxonomy 2
Relevant Info
Irrelevant Info
http://itlab.ee.ncku.edu.tw/
Results of Clustering Search Engines
Taxonomy 1
Taxonomy 1
Taxonomy 1
Taxonomy 1
Taxonomy 2
Taxonomy 2
Taxonomy 2
Taxonomy 3
Summary A Summary B Summary C
Relevant Info
Irrelevant Info
http://itlab.ee.ncku.edu.tw/
Vivisimo
http://itlab.ee.ncku.edu.tw/
iBoogie
http://itlab.ee.ncku.edu.tw/
KartOO
http://itlab.ee.ncku.edu.tw/
Clustering Search Engine System : CSES
We proposed a novel Clustering Search Engine System, called CSES. The information coverage provided by search engines The relevance of information offered by directory
search systems.
We proposed a simple but novel algorithm for clustering the web pages. This algorithm is fundamentally different from traditional clustering algorithms that require a tremendous amount of computation time.
http://itlab.ee.ncku.edu.tw/
Taxonomy Information System
CSES : Framework
Meta-search Engine Meta-Directory System
Web sites
Directory Tree
Yahoo! Dir Google Dir
query
clustering
result
Data Grid
Grid Computing
GoogleYahoo! MSN ODP Dir
http://itlab.ee.ncku.edu.tw/
CSES: Meta-Directory Tree Based Clustering
Compare the Similarity MA1
Compare the Similarity MA2
Taxonomy: Tax1Sub tax3Ex: MusicBand
Directory tree (ODP, Yahoo! and Google)
Web sites
Similarity computation of MA1 and MA2 is based on the term frequency
http://itlab.ee.ncku.edu.tw/
CSES: User Interface
Tree Structure of ClustersSearch Results
Input Area
http://itlab.ee.ncku.edu.tw/
Future Work of CSES
The problem of Cluster Search Engine Computation Load (response time) Accuracy (relevant) Information Display (user interface)
Grid computing, distributed computing
Social network
Applied this framework to other areas