cs6322: information retrieval sanda harabagiu projects fall 2015 engines
TRANSCRIPT
CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda Harabagiu
Projects Fall 2015
Engines
Project # 1Search Engine for Movies : 3 students (optional 5)
– Crawl 100 000 pages from imdb.comE.g. Start from:An American woman, well-versed in political campaigns, is sent to the
war-torn lands of South America to help install a new leader but is threatened to be thwarted by a long-term rival.Director:David Gordon Green Writers:Peter Straughan (screenplay), Rachel Boynton (documentary) Stars:Sandra Bullock, Billy Bob Thornton, Anthony Mackie | See full cast and crew »
Project # 1 - continuedSearch Engine for Movies
– Crawl 100 000 pages from imdb.comContinue crawling:
Director:David Gordon Green Writers:Peter Straughan (screenplay), Rachel Boynton (documentary) Stars:Sandra Bullock, Billy Bob Thornton, Anthony Mackie | See full cast and crew »
1 Student responsible for crawling
Project # 1 - continuedSearch Engine for Movies
– Index 50 000 pages from imdb.com & Create the WEB graph + use index and the graph to develop 2 relevance models + Topic-specific Page Ranking + HITS to rank the results
1 Student responsible for Indexing and relevance
Project # 1 - continuedSearch Engine for Movies
– Prepare the Graphical User Interface for:Introducing the queryPresenting the resultsShowcasing the results from Google and Bing for the same query
In 2 additional frames on the same web pageThe results should consist of three frames on the same page:
1. Results of your search engine2. Results from Google3. Results from Bing1 Student responsible for :
User interfaces and Comparisons with GoogleAnd Bing
Project # 1 - continuedSearch Engine for Movies - OPTIONAL
– Cluster the web pages to improve results:Use flat clusteringUse 4 methods of agglomerative clusteringProvide experimental results for 500 queries
with and without clustering
1 Student responsible for :Clustering & experiments
Project # 1 - continuedSearch Engine for Movies - OPTIONAL
– Query expansion through pseudo-relevance feedback:
Implement the Roccio algorithm and test it on 20 queriesUse for automatic query expansion for 50 queries:
– association clusters, – metric clusters– Scalar clusters
– Provide experimental results for 50 queries
1 Student responsible for :Relevance feedback and Query expansion
Web CrawlersWeb Crawlers
How do the web search engines get How do the web search engines get all of the items they index?all of the items they index?
How much stuff is out there?How much stuff is out there?
How do you store millions of words How do you store millions of words from hundreds of sites so that you from hundreds of sites so that you can find them quickly (and can find them quickly (and efficiently)?efficiently)?
Depth-First CrawlingDepth-First Crawling
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page1 11 21 41 61 31 53 15 16 15 22 12 22 3
Breadth FirstBreadth First
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page1 12 11 21 61 32 22 31 43 11 55 15 26 1
Additional Search Engines
33-5 students-5 studentsSearch Engine for Countries/ WikipediaSearch Engine for Commerce/ AmazonSearch Engine for Restaurants