search strategies based on cluster-based indexing and retrieval sophiasearch

Click here to load reader

Post on 23-Feb-2016

24 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Search Strategies based on cluster-based indexing and retrieval www.sophiasearch.com. Our Philosophy of Search. A document collection can be viewed as consisting of many hundreds of thousands of documents (typical in a medium size enterprise) - PowerPoint PPT Presentation

TRANSCRIPT

  • Search Strategiesbased on cluster-based indexing and retrievalwww.sophiasearch.com

  • Our Philosophy of SearchA document collection can be viewed as consisting of many hundreds of thousands of documents (typical in a medium size enterprise)

    Subgroups of documents are related to each other based on their general themes as discovered clusters (SOPHIA1,2).

    These themes can be further broken down into individual topics as sub-clusters

    By automatically discovering themes present in the collection and breaking them down into topics we can create intuitive groupings of semantically similar documents and present these to users.

    We provide a topical overview of the structure of the collection that enhances browsing

  • Our Philosophy of SearchDuring search a theme can be viewed as consisting of one or more related topics. Themes can be accessed from the theme panel view of the collection.

    Each topic contains one or more documents relevant to that topic (and obviously the theme itself). Documents are accessed via the topic panel view of the collection.

    Users browse from theme level, to topic level and then choose documents that are relevant.

    Users have varying search requirements. We believe we should provide tools to facilitate these. Therefore we have 3 main search scenarios Focused Search, Blanket Search and Query by Example.

  • General Overview of SearchIrrespective of which type of search scenario you use, type terms into the query panel.

    You will be presented with a list of themes relevant to your search terms (Theme panel view)

    Using the theme descriptions presented, click on the one you are most interested in. This takes you to the topic panel

    This provides an overview of the themes topics and a list of the documents belonging to the most relevant topic.

    A document can be clicked on and read (Document panel view) or a new topic clicked to examine the documents it contains

    At any time you can go back to the original theme descriptions to browse another theme or try another query

  • Focused and Blanket SearchFacility for both a specific and speculative search mechanismFocused Search is ideal for finding specific information, when you know, what you are looking for.Blanket Search is suitable to find general (diverse) topics related to search terms, to facilitate exploration and the selection of the most relevant ones for deeper analysis. Query by example allows a search based on a given document rather than a key word query.

  • Blanket SearchCompare a query and clusters centroid by considering query as a probability distribution of terms using JS-divergence.Rank clusters according to increasing divergenceConsider extension for adding diversity measure to ranking so that different themes relevant to the query are foremost also.

  • Make A Query

  • Theme Panel Overview #1

  • Select the Relevant Theme1) Read theme names and descriptions2) Scroll to read them all3) Select the relevant theme

  • Focused Search

    Use inverted index of terms/phrases to identify documents that are relevant to the query based on term frequency of terms within documents where the terms occurs.Rank clusters according to highest proportion of relevant documents within clusters

  • How to explore From the Topic PanelOnce you have selected a theme for further analysis you are presented with a matrix of topics. Each topic has a size, a description and a colour.

    The size gives a visual indication of the relevancy of the particular topic to the query terms enabling the user to very quickly focus in on the best topics for document retrieval.

    The most relevant topics are presented on the top left hand side of the matrix.

    The description helps you understand the content of the topic

  • How to explore From the Topic PanelTopics of the same colour are closely related in content. We refer to topics of the same colour as presenting similar aspects of the theme to the user. If the colours of 2 topics is different we say they are on slightly different aspects of the theme.

    Initially the most relevant topic in a theme is automatically selected. The documents it contains are listed on the right hand side of the screen. They can be viewed 10 at a time.

    Based on topic descriptions, the user may want to click on other topics within this theme. This action displays the documents of the newly selected topic.

    Documents have titles and summaries associated with them (based around the original query terms used for search). Using this information a document can be selected and clicked on to display its contents.

  • How to explore From the Theme PanelOnce you have entered your search terms and have a list of relevant themes presented (as in previous slide) you can use the functionality offered by the Theme panel view to explore further.

    Use the Theme descriptions (LHS) combined with the Topic descriptions (RHS) to determine the most useful Theme for further analysis.

    Click on a Theme to get a more detailed overview of the different Topics it contains.

  • Topic Panel #1

  • Topic Panel #2Document summary, with keywords highlighted Selected Topic . Its size indicates its relevance to the query. A bigger Topic is more relevant to specified query.Click to display full document.Colour indicates aspect. Current document Page

  • Focused SearchA query is entered as with blanket search just make sure the focused search radio button is active.

    Use the character to indicate words you want to exclude. Eg. The following query Rugby Ulster RavenhillReturns clusters that have the highest proportion of documents that contain rugby but not Ulster or Ravenhill. By excluding you will get back documents that contain all 3 terms

    Themes are presented using the same interface as before.

    By clicking on a theme, the topics it contains are presented as before

  • Make the First Query2)Type a query1) Select Focused Search3) Pressthe Searchbutton

  • Theme Panel View1) select the relevant theme

  • Topic Panel View

  • Document Panel View

  • Query by ExampleThis is a powerful and unique feature of our search engine

    It enables you to present an example document or portion thereof as a query to retrieve topically similar documents

    Firstly create a text file containing the content you want to use as your exemplar document (use notepad under accessories to paste content into, then save to disk)

    Click the query by example radio button

    Use the browse button to select the location of the newly created text document

    Then press search

    Results are presented using the now familiar theme based approach where the theme that contains the most documents related to the concepts of the query document are ranked highest

  • Query by Example2) Directory and name of query document1) Press to locate query text file on your local disk3) Click to find topically similar documents

  • Query By Example (1)He is being hailed this morning as a tragic figure who might just have stepped from a Wagnerian opera. The German papers today expressed their sympathy with Jens Lehmann, whose "moment of madness" in the Champions League final between Arsenal and Barcelona led to him being sent off in the 18th minute, ultimately leading to Arsenal's 2-1 defeat.

    The papers all agree that Lehmann deserved to be punished after plucking at the boot of Barcelona's Samuel Eto'o. But there was criticism also in Germany of the Norwegian referee's decision to give Lehmann the red card. "The cleverest decision of referee Terje Hauge would have been to give the advantage and allow the goal for Barcelona - and to have warned Lehmann, the German number one," the Berliner Zeitung wrote this morning. It added that Lehmann's sending off "decimated" his team, a fate that Arsenal had not really "deserved".Query Document

  • Themes Returned

  • Topics in Highest Ranked ThemeMost conceptually relevant documents to query within best topicMost relevant topic to query

  • References1 Niall Rooney, David Patterson, Mykola Galushka, Vladimir Dobrynin: A scaleable document clustering approach for large document corpora. Inf. Process. Manage. 42(5): 1163-1175 (2006)2 Vladimir Dobrynin, David W. Patterson, Mykola Galushka, Niall Rooney: SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection. IEEE Transactions on Information Technology in Biomedicine 9(2): 256-265 (2005)

View more