content based search rajesh kumar jain roll no: 07405402 (rkjain@cse.iitb.ac.in)

Post on 30-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Content Based Search

Rajesh Kumar JainRoll No: 07405402

(rkjain@cse.iitb.ac.in)

Agenda-e-DayMotivation What Do People Want from

Search Engine? Types of Search EnginesExisting Search Engines (Google,

Yahoo, Ask AppliedSemantics) INIS – International Nuclear Information System

AgroExplorer Our approach – Functional Architecture with exa.

Conclusion and Future Work

MotivationWeb major source of information. Need for search engines

Efficient and time saving.Language barrier.Most relevant documents.

Meaning Based Search Used to retrieve most relevant documents

Multilingual SearchUsed to eliminate language barrier.

What Do People Want from Search Engine?

Integrated SolutionsDistributed SolutionsEfficient, Flexible Indexing and

Retrieval Interfaces and Browsing Effective Retrieval Multimedia Retrieval Information Extraction Relevance Feedback

Types of Search Engines Individual Search engines

Compile their own databases. Further classified as

Keyword based search engines. Search on the keywords. e.g. Google.

Meaning based search engines. Search on the meaning or semantics. e.g. AgroExplorer

Meta Search engines Do not compile their own databases.Search databases of different search engines. e.g.

Dogpile.

Subject DirectoriesCreated and maintained by human editors. I.e. LIBRARIANS'

INDEX http://lii.org, INFOMINE http://infomine.ucr.edu, ACADEMIC INFO, http://www.academicinfo.us

Existing Search Engines -Google

Keyword Based Search

Page RankRelative importance of the web

page.

Anchor Text

Existing Search Engines – .

Yahoo! search http://search.yahoo.com

? Huge (15 or more billion web pages)

? Relevancy ranking (word proximity and placement) - not popularity ranking

? Capitalize OR, AND, or AND NOT. Put parentheses around words joined by OR.

? No search-size word limit (Google limits you to 32 terms)

Services and tools similar to Google's

Existing Search Engines – .

Differences between searching Google and Yahoo! Search

Parentheses around ORed terms – sometimes works without parentheses

("global warming" OR "greenhouse effect") rise "sea level" (california OR "los angeles" OR "san diego" OR "san francisco")

Supports intitle: site: inurl: hostname:(for entire site name - hosthame:google.com

Shortcuts available at http://tools.search.yahoo.com/shortcuts

Existing Search Engines – .

Ask.com http://ask.com

Subject-Specific Popularity ranking (links from pages on same subject as your search)

Search results analyzed to provide:

BROADER & NARROWER TERMS suggestions

Smaller database than Google or Yahoo! - about 2 billionNo differences between basic searching in Google and searching Ask.

.com

Existing Search Engines – AppliedSemantics

•Internet’s first meaning based search engine.

•Used in Google Adsense (Advertising solutions).

•CIRCA technology used. (Conceputal Information Retrieval and Communication Architecture)

•CIRCA has

•a scalable, language independent ontology.

•Ontology has

•Millions of words with their meanings

•Conceptual relationships to other meanings.

CIRCA•Identifies concepts related to specific words and phrases.

•Finds how close “phrase A” is to “concept B”.

•For a given query

•Finds the distance between the query and various concepts in the database.

•E.g. Query – “Colorado Bicycle trips”.

•Possible concepts– region, bicycling, travel, etc.

Existing Search Engines – ..com

INISThere are three major INIS products: The INIS Database, which today contains 2.9 million

bibliographic records; it is accessible by subscription only and has currently 1.3 million authorized users.

A unique collection of over 850 000 full-text documents (non-conventional "grey" literature – NCL) in 63 languages, including many documents that cannot easily be found anywhere else.

The INIS Multilingual Thesaurus – a major tool for describing nuclear information and knowledge in a structured form, which assists in multilingual and semantic searches.

INIS-Features and Benefits

IAEA official design Direct access to NCL documents in pdf format Extended and configurable hyper-linking of external web

addresses and emails, facilitating easier access to NCL documents on external systems or contacting authors

Weekly email notifications Improved usability:

Allows users to see the query and its results at the same time Allows users to preserve previously run queries for comparison

purposes. Displays records in reverse chronological order, giving users quick

access to the latest records. Better documentation:

Tool-tips assist users in performing tasks Static help pages with "how-to" documents, manuals and glossary

of terms can be opened in separate window for consultation.

INIS-Features and Benefits

Improved configurability: Allows users to fully customize the search mask and search results pages The interface can be used in English, German and Spanish, with

Portuguese to be added soon. More languages can be added upon demand

Anonymous users can register their own profiles and enjoy personalized features

Improved Index/Authority Navigator with search-composing assistant (CTRL-CLICK)

Increased data export capabilities: new formats (XML, Excel, formatted text, delimited text, HTML), sorting of exports

The type-ahead, search-ahead functionality "INIS Suggest" assists users when entering search terms and shows the hit count before the search is executed; this provides additional useful information when composing queries

Searches are much faster, now enabling queries that used to time out in the old system. Most queries are estimated to be between 5 and 20 times faster

INIS-Features and Benefits

Support for concurrent users: a round-robin load balancer distributes the load among different databases

Improved maintenance: all update procedures are automated, require no human intervention and notify administrators in case of problems

Zero downtime per week: updates are transparent to users, who can use the system 24/7 without performance detriments.

AgroExplorer A meaning based multilingual search engine. Agriculture domain. UNL is used as interlingua. Supports english, hindi and and marathi languages.

Methodology User phrases the query in native language. System translates it to Universal Networking

Language (UNL). UNL corpus is searched. Related documents in UNL are fetched. Fetched documents are converted to native

language.

AgroExplorer

Query Output Complete Expression Matching.

Retrieves completely relevant documents where query UNL graph is a subgraph of any sentence UNL graph.

Partial Expression Matching Retrieves relevant documents where query UNL

graph is a part of any sentence UNL graph. Universal Word Matching

Search on Universal words which are concepts, not just keywords.

Keyword Based Matching. Traditional search. Lucene search engine used.

Multilingual Information Retrieval

Need Document collection contains

documents in many languages. User may not be fluent to express

query in document language.

Approaches Machine translation for text

translation Thesaurus/Dictionary Based Corpus Based (Sub word clusters)

Our Aproach – Functional Architecture

Example…

Commercial Description:1. Automobile Radio and Stereo Retail Store;

2. Automobile Engine Rebuilding, Repair,

and Exchange Workshop;

3. Car Repair and Retail Shop;

4. Jeep Repair and Retail Shop; and

5. Motor Mending and Replacement Workshop.

Example… For our search, we shall compare these encoding

and retrieval techniques:

a flat list of words,

a structured list of words,

a flat list of word senses plus the linguistic Ontology

a structured list of word senses, using WordNet’s ontology.

Method – Flat list of Words

 

Both recall and precision of this method is very bad!!!

NO. QUERY DESCRIPTIONS FOUND

1 Automobile 1, 2

2 Automobile Retail

1

3 Car Repair 3

4 Motor Repair -

5 Engine Repair 2

6. Motor Exchange

-

Method – Structured list of Words

NO. BUSINESS TYPE

ACTIVITY OBJECT MARKET AREA

1 Store Retail Radio Automobile

  Store Retail Stereo Automobile

2 Workshop Rebuilding

Engine Automobile

  Workshop Repair Engine Automobile

  Workshop Exchange Engine Automobile

3 Shop Retail Car  

  Shop Repair Car  

4 Shop Retail Jeep  

  Shop Repair Jeep  

5 Workshop Replacement

Motor  

  Workshop Mending Motor  

Method – Structured list of Words

 

Recall remains the same because we have not eliminated the semantic-match problems.

Method –WordNet Synset and Linguistic ontology

NO.

DISAMBIGUATED DESCRIPTION

1 [car, auto, automobile, machine, motorcar], [radio receiver, receiving set, radio set, radio, tuner, wireless], [stereo, stereo system, stereophonic system], [retail, sell retail], [shop, store]2 [car, auto, automobile, machine, motorcar], [engine], [rebuilding], [repair, fix, fixing, mending, reparation], [substitution, exchange], [workshop, shop]

3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]

4 [jeep, landrover], [repair, fix, fixing, mending, reparation], [retail, sell retail], [shop, store]

5 [motor], [repair, fix, fixing, mending, reparation], [replacement, replacing], [workshop, shop]

Method – Flat list of Word senses and Linguistic

ontologyNO.

DISAMBIGUATED QUERY DESCRIPTIONS FOUND1 [car, auto, automobile, machine,

motorcar] 1, 2, 3, 4

2 [car, auto, automobile, machine, motorcar], [retail, sell retail]

1, 3, 4

3 [car, auto, automobile, machine, motorcar], [repair, fix, fixing, mending, reparation]

2, 3, 4

4 [motor], [repair, fix, fixing, mending, reparation]

2, 5

5 [locomotive, engine, locomotive engine, railway locomotive], [repair, fix, fixing, mending, reparation]

6 [motor], [substitution, exchange] 2, 5

Method – Flat list of Word senses and Linguistic

ontology 

Decouple the user vocabulary from the data vocabulary, by covering the most common English words;Increase recall, by exploiting the hierarchy to make generic queries and recognizing synonyms;Increase precision, through the disambiguation mechanism and the ability to navigate the hierarchy to select specificqueries

Conclusion and Future action…

Meaning based search engines can include the concept or idea expressed by the user in his query and can thus provide more accurate results than the traditional keyword search engines.

Universal Networking Language (UNL) can be used as an effective interlingua, to represent information in documents written in natural languages.

Multilingual search engines can help the users to access documents written in languages, other than the query language.

Future Work The lack of a large scored, multilingual corpus and the

adverse effects of polysemous words are found to be the cause of most of the limitations of MLIR systems. Research efforts are being directed towards these fields and approaches to use interlingua like UNL, subword clusters, etc. effectively for MLIR.

References “What Do People Want from Information Retrieval?”, W. Bruce

Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst

“Beyond Google”, Joe Barker, jbarker@library.berkeley.edu, John Kupersmith, jkupersm@library.berkeley.edu, A “Know Your Library” Workshop Teaching Library, University of California, Berkeley Fall 2006 

D.W. Oard and B.J. Dorr, A survey of multilingual text retrieval.Institute of Advanced Computer Studies and Computer Science Department University of sity of Maryland, 1996.

Mrugank Surve, Sarvjeet Singh, Satish Kagathara, AgroExplorer Group and , Pushpak Bhattacharyya, AgroExplorer: a Meaning Based Multilingual Search Engine, International Conference on Digital Libraries, Delhi, India, February,2004.

The UNL Center, The Universal Networking Language (UNL) Specifications. UNDL Foundation, 3rd edition, December 2004.

S. Singh, A Multilingual Meaning Based Search Engine, B.Tech Project Report, Indian Institute of Technology Bombay, 2003.

U. Hahn, K. Marko, S. Schulz, Subword Clusters as Light Weight Interlingua for Multilingual Document Retrieval, Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, (MT-Summit X) Phuket, Thailand. 2005.

References (cont) K. Marko, U. Hahn, S. Schulz, P. Daumke, and P. Nohama,

Interlingual indexing across different language, In RIAO 2004 – Conference Proceedings. Avignon,

France, 26-28 April 2004. Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd,

The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library, Technologies Project, 1998.

K. Marko, S. Schulz, A. Medelyan and U. Hahn. 2005, Bootstrapping Dictionaries

for Cross Language Information Retrieval, In SIGIR 2005 , Proceedings of the Proceedings of the

28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 15-19, 2005.

top related