computer science 1000 information searching ii permission to redistribute these slides is strictly...

51
Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Upload: maude-lang

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Computer Science 1000

Information Searching II

Permission to redistribute these slides is strictly prohibited without permission

Page 2: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Enginea collection of computer programs designed

to help us find information on the Web typically served through a websitedifferent search providers exist, but basic

functionality is consistent type keywords into a text boxpage returns links to other pages

Page 3: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Enginewhy is a search engine like an index?

recall that an index maps keywords to a location in some medium (like a page number in a book)

a search engine does a very similar thing takes keywords of interest from a user maps these keywords to relevant web pages

in fact, one of the key components of a search engine is its index

Page 4: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Enginewhat differentiates a search engine from

other indexes (like a book index)? the ability to quickly combine keywords in

searches e.g. search for information on ducks and foxes

result rankingpersonalizationamong others …

Page 5: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Engine – How it Worksdifferent search engines employ different

technologies the full details of commercial search

engines are typically not publichowever, some of the basics are consistent

crawling indexingquery processing

Page 6: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Crawling for a search engine to be able to link to a web page,

it must know about its existence search engines find pages by crawling the web

programs called crawlers or spiders e.g. Googlebot

a crawler visits web pages, in much the same way that you do

as each page is visited, information is remembered about the page (indexing)

Page 7: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Crawling – Todo List the todo list is a list of pages that

are visited by the crawler the crawling process starts with

an initial to-do list, populated with sites from previous crawls

however, the list is updated as the crawl takes place

hyperlinks on visited sites are added to the list

http://www.uleth.cahttp://www.tsn.cahttp://www.usask.ca...

Todo List

Page 8: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Crawling – Examplesuppose that this page was being

processed by a crawler

Kev's Page

Favorite Stuff:

• New York Islanders

• Saskatchewan Roughriders

• John Deere

as a consequence of this page being crawled, its links would be added to the todo list (if they aren't already there)

those pages would subsequently be checked by the crawler at some point

Page 9: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web"not all information is crawled, which means

it are not visible to search enginessome pages are new, and haven't yet had a

chance to be crawledhowever, there are other reasons that certain

information does not get crawled

Page 10: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web" 1) No hyperlinks to that page

recall that in order for a page to be crawled, it must be: on the todo list be linked to a page that appears on the todo list

without a hyperlink, that page will never be found

Page 1Page 2Page 3

Page 1

Page 4

Page 2

Page 3

Page 6

Page 4

Page 5

Page 6

Todo List Web pages

Page 5 will not be crawled, as it is not on the to-do list, and no other pages link to it.

Page 11: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web" 2) The Page is synthetic

a synthetic page is created on demand, depending on user input

e.g. the results of a search on another search engine

My personal search for "New York Islanders" on Bing results in an on-demand page that is not stored. Hence, it will not be crawled.

Page 12: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web" 3) The content is unreadable to the crawler

search engines are primarily text-based certain data, such as movie content, is not crawlable

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=72746

The webpage containing the movie might be crawled, but not the movie itself.

Page 13: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web" 4) The content is password-protected

if you require a password to access a page, then so does a search engine*

Page 14: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The "Invisible Web" 5) You ask the search engine to

ignore your site the presence of certain files stored

with your website will restrict your site from being crawled

e.g. The Robots Exclusion Protocol a file called robots.txt can be stored that

will request that your site (or just certain pages) are not indexed

unlike the previous four examples, this does not prevent search engines from crawling your site

they can choose to ignore robots.txt

http://www.robotstxt.org/

User-agent: Google Disallow:

User-agent: * Disallow: /

Example:

Page 15: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Indexing the primary role of the crawler is to build an index an index is a list of tokens

words phrases (not considered here)*

each token is associated with a list of URLs in other words, like a book index, but with page URLs instead of

page numbers other information might be stored with URLs (e.g. page location

of token) these indexes are saved by the search provider

search queries use information from the indexes (fast), rather than crawling the web for each query (slow)

*http://www.google.com/patents/US7536408

Page 16: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Index Lists – Example

* from text – Figure number might be different

Page 17: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Indexing – What Makes a Token? page text

a common approach search providers differ on which text is selected*

some may use all text others may only use certain text, such as:

titles and headings frequently occuring words words occuring early in a page

sometimes, stop words (a, an, the) are ignored

hyperlink text the term from a hyperlink on another page may be used to

describe the page that it links to

*http://computer.howstuffworks.com/internet/basics/search-engine1.htm

Page 18: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processing the part of the search engine that we see the query processor:

reads words/phrases from the user interface returns pages that are relevant to that query

modern query processors: are extremely fast are very accurate allow a considerable variety in their capabilities

how does this all work?

Page 19: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processing – How it works let's start simple: suppose we search for a

single word (e.g. cat) in a nutshell:

the search engine finds the list for the token 'cat' contains list of pages that contain 'cat' in the appropriate text

(e.g. title)

this list is ranked according to perceived relevance the ranked list is returned as an ordered set of

hyperlinks

Page 20: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processing – How it worksStep 1: the search engine finds the list for

the token 'cat'

Page 21: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processing – How it worksStep 2: this list is ranked according to

perceived relevance

www.cat.comen.wikipedia.org/wiki/Catwww.youtube.com/watch?v=J---aiyznGQ...

Page 22: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processing – How it worksStep 3: the ranked list is returned as an

ordered set of hyperlinks

www.cat.comen.wikipedia.org/wiki/Catwww.youtube.com/watch?v=J---aiyznGQ...

Page 23: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Query Processingwhat about multi-word searching?

as mentioned, some search engines index phrases as well

however, what if a particular phrase is not indexed?

e.g. (text) red fish guppy

solution: intersecting queries the webpages that are common to all of the search words

are returned

Page 24: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Intersecting Queries example (text): suppose the query was “red fish guppy” further suppose that the indexes for each word were as

follows: result is the set of sites that contain all of the keywords in other words, the sites that are found on all three lists

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

guppy: en.wikipedia.org/wiki/guppywww.ifga.orgwww.fullredguppy.comwww.sciencedaily.comwww.tropicalfish.com

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

guppy: en.wikipedia.org/wiki/guppywww.ifga.orgwww.fullredguppy.comwww.sciencedaily.comwww.tropicalfish.com

Result:www.fullredguppy.comwww.sciencedaily.com

Page 25: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Intersecting Queries - Efficiency the size of index lists can be large

'cat' returns over 2.3 billion resultsmodern search engines are fasthence, clever algorithms must be developed

for optimizing queriesexample: intersecting queries

Page 26: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Intersecting Queries - Efficiency suppose you had two search terms

e.g. red and fish

the query processor has a list for tokens suppose each list contained 1 billion tokens let's consider a method for performing the

intersecting query that is, how do we find all pages that occur on both lists?

Page 27: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive Approach for each entry in the 'red' list

search through the entire 'fish' list if we find the entry from the red list, then add

that to our result

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:

Page 28: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive ApproachFirst search: www.sciencedaily.comdo we find it in second list?

yes – add it to result

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:www.sciencedaily.com

Page 29: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive ApproachSecond search: en.wikipedia.org/wiki/reddo we find it in second list?

no

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:www.sciencedaily.com

Page 30: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive ApproachThird search: newsroom.urc.edudo we find it in second list?

yes, add it to list

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:www.sciencedaily.comnewsroom.urc.edu

Page 31: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive ApproachFourth search: www.red.comdo we find it in second list?

no

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:www.sciencedaily.comnewsroom.urc.ed

Page 32: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive ApproachFifth search: www.fullredguppy.comdo we find it in second list?

yes – add it to list

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:www.sciencedaily.comnewsroom.urc.eduwww.fullredguppy.com

Page 33: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Naive Approachproblems?

slow!! for each URL in left list, we potentially had to

compare it to every URL in right listunder our previous assumption (billion size lists),

we have to do 1 billion x 1 billion comparisonseven for a powerful computer, this would require

a considerable amount of time

Page 34: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Alphabetized Lists suppose that each list was maintained

alphabetically then we could employ the following approach

place a marker at start of each list if markers point to same URL:

add URL to result list move both markers down

otherwise, move the marker whose URL is lexicographically smaller

stop when at least one marker goes off the end of the list

Page 35: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachplace markers at the start of each list

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 36: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

nosince right marker's URL is less than left

marker's URL, move right marker down

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 37: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

nosince left marker's URL is less than right

marker's URL, move left marker down

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 38: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

yes add URL to result move both markers

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.edu

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 39: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

nosince right marker's URL is less than left

marker's URL, move right marker down

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.edu

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 40: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

yes add URL to result move both markers

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.eduwww.fullredguppy.com

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 41: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

nosince left marker's URL is less than right

marker's URL, move left marker down

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.eduwww.fullredguppy.com

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 42: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachdo markers point to same URL?

yes add URL to result move both markers

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.eduwww.fullredguppy.comwww.sciencedaily.com

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 43: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachat least one marker has completed its list,

so we can stopnotice that our result contains correct values

red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

result:newsroom.urc.eduwww.fullredguppy.comwww.sciencedaily.com

red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com

fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com

Page 44: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approachhow many comparisons are done?

note that every step involves moving at least one arrow

hence, the maximum number of steps is 2 billion this is considerably less than (1 billion) squared result: a massive speedup

Page 45: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

The Sorted Approach – Notes remember: commercial search engines don't fully

publicize strategies hence, some search engines may use alternate

approaches for efficient intersections

the previous strategy applies to more than two lists simultaneously

hence, we can search for multiple tokens, rather than just two

Page 46: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Example (from text):

Page 47: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Ranking Results a typical search can produce

millions of results however, we often find what we

are looking for in the first few results

according to Optify, first returned result from Google gets clicked 36.4% of time

first page gets clicked through 90% of the time

how does this occur? via a page ranking system

http://searchenginewatch.com/article/2049695/Top-Google-Result-Gets-36.4-of-Clicks-Study

Page 48: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Ranking Resultssearch providers have different ways

of ranking the results of the searchGoogle: PageRank

proprietary (not all details available) some details are public (considered next) the higher the PageRank score, the closer to

the top of the search results a page will be

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897

Page 49: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

PageRanka scoring system links from other pages add to a page's

score

Page 1

Page 4 Page 5

Page 2

Page 5 Page 6

Page 3

Page 5 Page 6

Page 4

Page 5

Page 6

Web pages

the link from Page 1 adds to Page 4's score

the links from Pages 1,2,3 add to Page 5's score

the links from Page 2 and 3 add to Page 6's score

Page 50: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

PageRank the score from each page is not weighted equally the higher a page's PageRank, the more important its

contribution is

Page 1

Page 3

Page 2

Page 4

Page 3

Page 4

Web pages suppose that Page 3

has one link (Page 1), and Page 4 has one link (Page 2)

since Page 2's rank is higher than Page 1's, then Page 4's rank will be higher than Page 3's

Hig

h R

an

k

Lo

w R

an

k

Page 51: Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

PageRank – Notes since a page is not necessarily aware of other

pages that point to it, its PageRank must be computed by the crawler

PageRank is only part of the ranking process that you see

Google uses over 200 factors to determine page relevancy

PageRank is one of those factors others include location, language, personalization, etc.

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897