cs6322: information retrieval sanda harabagiu projects fall 2015 engines

11
CS6322: CS6322: Information Information Retrieval Retrieval Sanda Harabagiu Sanda Harabagiu Projects Fall 2015 Engines

Upload: rodney-allen-wilkins

Post on 19-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

CS6322: CS6322: Information Retrieval Information Retrieval Sanda HarabagiuSanda Harabagiu

Projects Fall 2015

Engines

Page 2: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1Search Engine for Movies : 3 students (optional 5)

– Crawl 100 000 pages from imdb.comE.g. Start from:An American woman, well-versed in political campaigns, is sent to the

war-torn lands of South America to help install a new leader but is threatened to be thwarted by a long-term rival.Director:David Gordon Green Writers:Peter Straughan (screenplay), Rachel Boynton (documentary) Stars:Sandra Bullock, Billy Bob Thornton, Anthony Mackie | See full cast and crew »

Page 3: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1 - continuedSearch Engine for Movies

– Crawl 100 000 pages from imdb.comContinue crawling:

Director:David Gordon Green Writers:Peter Straughan (screenplay), Rachel Boynton (documentary) Stars:Sandra Bullock, Billy Bob Thornton, Anthony Mackie | See full cast and crew »

1 Student responsible for crawling

Page 4: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1 - continuedSearch Engine for Movies

– Index 50 000 pages from imdb.com & Create the WEB graph + use index and the graph to develop 2 relevance models + Topic-specific Page Ranking + HITS to rank the results

1 Student responsible for Indexing and relevance

Page 5: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1 - continuedSearch Engine for Movies

– Prepare the Graphical User Interface for:Introducing the queryPresenting the resultsShowcasing the results from Google and Bing for the same query

In 2 additional frames on the same web pageThe results should consist of three frames on the same page:

1. Results of your search engine2. Results from Google3. Results from Bing1 Student responsible for :

User interfaces and Comparisons with GoogleAnd Bing

Page 6: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1 - continuedSearch Engine for Movies - OPTIONAL

– Cluster the web pages to improve results:Use flat clusteringUse 4 methods of agglomerative clusteringProvide experimental results for 500 queries

with and without clustering

1 Student responsible for :Clustering & experiments

Page 7: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Project # 1 - continuedSearch Engine for Movies - OPTIONAL

– Query expansion through pseudo-relevance feedback:

Implement the Roccio algorithm and test it on 20 queriesUse for automatic query expansion for 50 queries:

– association clusters, – metric clusters– Scalar clusters

– Provide experimental results for 50 queries

1 Student responsible for :Relevance feedback and Query expansion

Page 8: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Web CrawlersWeb Crawlers

How do the web search engines get How do the web search engines get all of the items they index?all of the items they index?

How much stuff is out there?How much stuff is out there?

How do you store millions of words How do you store millions of words from hundreds of sites so that you from hundreds of sites so that you can find them quickly (and can find them quickly (and efficiently)?efficiently)?

Page 9: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Depth-First CrawlingDepth-First Crawling

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page1 11 21 41 61 31 53 15 16 15 22 12 22 3

Page 10: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Breadth FirstBreadth First

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page1 12 11 21 61 32 22 31 43 11 55 15 26 1

Page 11: CS6322: Information Retrieval Sanda Harabagiu Projects Fall 2015 Engines

Additional Search Engines

33-5 students-5 studentsSearch Engine for Countries/ WikipediaSearch Engine for Commerce/ AmazonSearch Engine for Restaurants