the university of kansas vitalseek dr. susan gauch
Post on 21-Dec-2015
221 views
TRANSCRIPT
The University of Kansas
Overview
• Provide technical and research capabilities for a Kansas City startup company
• Partner with Today Communications, Inc. to provide high quality, online medical information
• Develop innovative, quality based rankings of online Web pages
• Transition technology for easy adaptation on behalf of potential clients, easy maintenance for sponsoring company
The University of Kansas
Project Goals
• System to support online entry of human judgments for a wide variety of medical Web sites on a large number of criteria
• Novel search engine combining traditional keyword-based retrieval with user-selected quality criteria
• Speed• Scalability• Reliability
The University of Kansas
Ranking System
• Ranking System- Online entry, viewing, validation, modification
- Over 150 criteria per site
- Sites rated- Overall
- Per Topic (50 topics)
The University of Kansas
Spider
• Automatically collect Web pages from Web sites– Keys off of sites as they are entered in ratings database
• Continuous loop– Visit sites
– Index content
– Revisit sites
• Multiple, concurrent spiders on a dedicated machine– Co-ordination
– Speed
The University of Kansas
Indexing Documents
• Initially, all documents are indexed together– Time Bottleneck (4+ days to index)
– Space Bottleneck (resulting file exceeds system limits)
• Revised version– Each site indexed separately
• Can visit, index in a loop site by site
– But, must select a subset of the collections to process for each query
• Classic distributed information retrieval problem
The University of Kansas
Retrieval System - Broker
• Given a query and a set of criteria• Phase I – Broker
– Select those web sites that meet the criteria• E.g., Privacy, Authority, Navigation
– Select those sites that have the best content from among the first set
• Number of documents with the query words
– Send the query to the top N sites (approx. 10)
The University of Kansas
Retrieval System – Query Processing
• For each site, – Identify the top documents for the query
• page weight with-respect-to query terms
• Site weight with-respect-to user criteria
• Combine these factors and rank the pages
• Fuse results from all sites– Merge the lists of pages based on weights
– Rearrange as necessary to provide results from a mix of sites on each page
The University of Kansas
Partner System
• Allows Vitalseek to be back end search engine• Results appear as though from partner• Web-based system for
– Entering partners
– Customizing results
– Customizing search criteria
The University of Kansas
Challenges
• Combining user criteria and keywords– Initial versions, used a weighted combination
– Abandoned in favor of filtering version
• Scalability– Thousands of sites
– Millions of pages• Spidering and indexing speed
• System limits
– Priority-based pruning of index files
• High-tech start-up demands, university research lab schedule