search strategies based on cluster-based indexing and retrieval sophiasearch

26
Search Strategies Search Strategies based on cluster-based based on cluster-based indexing and retrieval indexing and retrieval www.sophiasearch.com www.sophiasearch.com

Upload: devika

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Search Strategies based on cluster-based indexing and retrieval www.sophiasearch.com. Our Philosophy of Search. A document collection can be viewed as consisting of many hundreds of thousands of documents (typical in a medium size enterprise) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Search StrategiesSearch Strategiesbased on cluster-based based on cluster-based indexing and retrievalindexing and retrieval

www.sophiasearch.comwww.sophiasearch.com

Page 2: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Our Philosophy of SearchOur Philosophy of Search1. A document collection can be viewed as consisting of many

hundreds of thousands of documents (typical in a medium size enterprise)

2. Subgroups of documents are related to each other based on their general themes as discovered clusters (SOPHIA1,2).

3. These themes can be further broken down into individual topics as sub-clusters

4. By automatically discovering themes present in the collection and breaking them down into topics we can create intuitive groupings of “semantically” similar documents and present these to users.

5. We provide a topical overview of the structure of the collection that enhances browsing

Page 3: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Our Philosophy of SearchOur Philosophy of Search6. During search a theme can be viewed as consisting of one

or more related topics. Themes can be accessed from the theme panel view of the collection.

7. Each topic contains one or more documents relevant to that topic (and obviously the theme itself). Documents are accessed via the topic panel view of the collection.

8. Users browse from theme level, to topic level and then choose documents that are relevant.

9. Users have varying search requirements. We believe we should provide tools to facilitate these. Therefore we have 3 main search scenarios – Focused Search, Blanket Search and Query by Example.

Page 4: Search Strategies based on cluster-based indexing and retrieval sophiasearch

General Overview of SearchGeneral Overview of Search1. Irrespective of which type of search scenario you use, type

terms into the query panel.

2. You will be presented with a list of themes relevant to your search terms (Theme panel view)

3. Using the theme descriptions presented, click on the one you are most interested in. This takes you to the topic panel

4. This provides an overview of the theme’s topics and a list of the documents belonging to the most relevant topic.

5. A document can be clicked on and read (Document panel view) or a new topic clicked to examine the documents it contains

6. At any time you can go back to the original theme descriptions to browse another theme or try another query

Page 5: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Focused and Blanket SearchFocused and Blanket SearchFacility for both a specific and speculative search mechanism

Focused Search is ideal for finding specific information, when you know, what you are looking for.

Blanket Search is suitable to find general (diverse) topics related to search terms, to facilitate exploration and the selection of the most relevant ones for deeper analysis.

Query by example allows a search based on a given document rather than a key word query.

Page 6: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Blanket SearchBlanket Search•Compare a query and cluster’s centroid by considering query as a probability distribution of terms using JS-divergence.•Rank clusters according to increasing divergence•Consider extension for adding diversity measure to ranking so that different themes relevant to the query are foremost also.

Page 7: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Make A QueryMake A Query

Page 8: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Theme Panel Overview #1Theme Panel Overview #1

Page 9: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Select the Relevant ThemeSelect the Relevant Theme

1) Read theme names and descriptions

2) Scroll to read them all

3) Select the relevant theme

Page 10: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Focused SearchFocused Search

•Use inverted index of terms/phrases to identify documents that are relevant to the query based on term frequency of terms within documents where the terms occurs.•Rank clusters according to highest proportion of relevant documents within clusters

Page 11: Search Strategies based on cluster-based indexing and retrieval sophiasearch

How to explore From the Topic How to explore From the Topic PanelPanel

Once you have selected a theme for further analysis you are presented with a matrix of topics. Each topic has a size, a description and a colour.

The size gives a visual indication of the relevancy of the particular topic to the query terms enabling the user to very quickly focus in on the best topics for document retrieval.

The most relevant topics are presented on the top left hand side of the matrix.

The description helps you understand the content of the topic

Page 12: Search Strategies based on cluster-based indexing and retrieval sophiasearch

How to explore From the Topic How to explore From the Topic PanelPanel

Topics of the same colour are closely related in content. We refer to topics of the same colour as presenting similar aspects of the theme to the user. If the colours of 2 topics is different we say they are on slightly different aspects of the theme.

Initially the most relevant topic in a theme is automatically selected. The documents it contains are listed on the right hand side of the screen. They can be viewed 10 at a time.

Based on topic descriptions, the user may want to click on other topics within this theme. This action displays the documents of the newly selected topic.

Documents have titles and summaries associated with them (based around the original query terms used for search). Using this information a document can be selected and clicked on to display its contents.

Page 13: Search Strategies based on cluster-based indexing and retrieval sophiasearch

How to explore From the Theme How to explore From the Theme PanelPanel

Once you have entered your search terms and have a list of relevant themes presented (as in previous slide) you can use the functionality offered by the Theme panel view to explore further.

Use the Theme descriptions (LHS) combined with the Topic descriptions (RHS) to determine the most useful Theme for further analysis.

Click on a Theme to get a more detailed overview of the different Topics it contains.

Page 14: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Topic Panel #1Topic Panel #1

Page 15: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Topic Panel #2Topic Panel #2

Document summary, with

keywords highlighted

Selected Topic . Its size indicates its relevance to the query. A bigger Topic is more

relevant to specified query.

Click to display full document.

Colour indicates aspect. Current document

Page

Page 16: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Focused SearchFocused SearchA query is entered as with blanket search – just make sure the focused search radio button is active.

Use the – character to indicate words you want to exclude. Eg. The following query

Rugby –Ulster –RavenhillReturns clusters that have the highest proportion of documents that contain rugby but not Ulster or Ravenhill. By excluding – you will get back documents that contain all 3 terms

Themes are presented using the same interface as before.

By clicking on a theme, the topics it contains are presented as before

Page 17: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Make the First QueryMake the First Query

2)Type a query

1) Select Focused Search

3) Pressthe “Search”

button

Page 18: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Theme Panel ViewTheme Panel View

1) select the relevant theme

Page 19: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Topic Panel ViewTopic Panel View

Page 20: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Document Panel ViewDocument Panel View

Page 21: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Query by ExampleQuery by ExampleThis is a This is a powerful and uniquepowerful and unique feature of our search engine feature of our search engine

It enables you to It enables you to present an example documentpresent an example document or portion thereof as a query or portion thereof as a query to retrieve topically similar documentsto retrieve topically similar documents

Firstly Firstly create a text filecreate a text file containing the content you want to use as your containing the content you want to use as your exemplar document (use notepad under accessories to paste content into, exemplar document (use notepad under accessories to paste content into, then save to disk)then save to disk)

Click the query by example radio button Click the query by example radio button

Use the browse button to select the location of the newly created text Use the browse button to select the location of the newly created text documentdocument

Then press searchThen press search

Results are presented using the now familiar theme based approach where Results are presented using the now familiar theme based approach where the theme that contains the most documents related to the concepts of the the theme that contains the most documents related to the concepts of the query document are ranked highestquery document are ranked highest

Page 22: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Query by ExampleQuery by Example

2) Directory and name of query

document

1) Press to locate query text file on your local

disk

3) Click to find topically similar

documents

Page 23: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Query By Example (1)Query By Example (1)

He is being hailed this morning as a tragic figure who might just have stepped from a Wagnerian opera. The German papers today expressed their sympathy with Jens Lehmann, whose "moment of madness" in the Champions League final between Arsenal and Barcelona led to him being sent off in the 18th minute, ultimately leading to Arsenal's 2-1 defeat.

The papers all agree that Lehmann deserved to be punished after plucking at the boot of Barcelona's Samuel Eto'o. But there was criticism also in Germany of the Norwegian referee's decision to give Lehmann the red card. "The cleverest decision of referee Terje Hauge would have been to give the advantage and allow the goal for Barcelona - and to have warned Lehmann, the German number one," the Berliner Zeitung wrote this morning. It added that Lehmann's sending off "decimated" his team, a fate that Arsenal had not really "deserved".

Query Document

Page 24: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Themes ReturnedThemes Returned

Page 25: Search Strategies based on cluster-based indexing and retrieval sophiasearch

Topics in Highest Ranked ThemeTopics in Highest Ranked Theme

Most conceptually relevant documents to query within best topic

Most relevant topic to query

Page 26: Search Strategies based on cluster-based indexing and retrieval sophiasearch

ReferencesReferences

1 Niall Rooney, David Patterson, Mykola Galushka, Vladimir Dobrynin: A scaleable document clustering approach for large document corpora. Inf. Process. Manage. 42(5): 1163-1175 (2006)2 Vladimir Dobrynin, David W. Patterson, Mykola Galushka, Niall Rooney: SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection. IEEE Transactions on Information Technology in Biomedicine 9(2): 256-265 (2005)