1 web basics slides adapted from – information retrieval and web search, stanford university,...

52
1 Web Basics Slides adapted from –Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

Upload: roy-watkins

Post on 16-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

1

Web Basics

Slides adapted from –Information Retrieval and Web Search, Stanford University,

Christopher Manning and Prabhakar Raghavan

–CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

Page 2: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

2

Web search

• Due to the large size of the Web, it is not easy to find the needle in the hay.

• Solutions– Classification

– Early search engines

– Modern search engines

– …

Page 3: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

3

Early solutions to web search

• Classification of web pages– Yahoo– Mostly done by humans. Difficult to scale.

• Early keyword-based engines ca. 1995-1997– Altavista, Excite, Infoseek, Inktomi, Lycos– Decide how queries match pages– Most queries match large amount of pages

– which page is more authoritative?

• Paid search ranking: Goto.com (aka overture.com, acquired by yahoo)

– Your search ranking depended on how much you paid– Auction for keywords: casino was expensive!

Page 4: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

4

Ranking of web pages

• 1998+: Link-based ranking pioneered by Google– Blew away all early engines save Inktomi

– Great user experience in search of a business model

– Meanwhile Goto/Overture’s annual revenues were nearing $1 billion

Page 5: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

5

Web search overall picture

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries

Page 6: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

6

Key components in web search• Links and graph: The web is a hyperlinked document collection, a graph.• Queries: Web queries are different, more varied and there are a lot of them.

How many?– 108 every day, approaching 109

• Users: Users are different, more varied and there are a lot of them. How many?

– 109

• Documents: Documents are different, more varied and there are a lot of them. How many?

– 1011. Indexed: 1010

• Context: Context is more important on the web than in many other IR applications.

• Ads and spam

CrawlUser RankRankCrawlUserGraph Spam

Page 7: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

7

Web as graph

• Web Graph– Node: web page

– Edge: hyperlink

RankCrawlUserGraph Spam

Page 8: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

8

Why web graph

• Example of a large, dynamic and distributed graph

• Possibly similar to other complex graphs in social, biological and other systems

• Reflects how humans organize information (relevance, ranking) and their societies

• Efficient navigation algorithms

• Study behavior of users as they traverse the web graph (e-commerce)

RankCrawlUserGraph Spam

Page 9: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

9

In-degree and out-degree

• In-degree: number of in-coming edges of a node

• Out-degree: number of out-going edges of a node

• E.g., – Node 8 has 3 in-degrees, 0 out-degree

– Node 2 has 2 in-degrees, and 4 out-degrees

• Degree distribution

RankCrawlUserGraph Spam

Page 10: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

10

Degree distribution

• Degree distribution is the fraction of the nodes that have degree i, i.e.

• Degree of Web graph obeys power law distribution

• Study at Notre Dame University reported – a = 2.45 for out-degree distribution

– a = 2.1 for in-degree distribution

• Random graphs have Poisson distribution

iip )(

degreesofnumbertotal

idegreehavingverticesofnumber)( ip

RankCrawlUserGraph Spam

Page 11: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

Graph example, matlab (or Octave)

G=[ 0,1,1,0,0,0,0,0,0,0;

0 0 1 1 0 0 0 1 1 0;

0 0 0 0 1 1 0 0 0 0;

0 0 0 0 0 0 0 1 0 0;

0 0 0 0 0 0 0 1 0 0;

0 0 0 0 0 0 0 0 0 1;

0 0 0 0 0 0 0 0 0 0;

0 0 0 0 0 0 0 0 0 0;

0 0 0 0 0 0 0 0 0 0;

0 1 0 0 0 0 1 0 1 0

];

11

indegree=sum(G)

outdegree=sum(G')

bin=0:4;

h=hist(indegree,bin);

subplot(1,2,1);

bar(bin,h);

title('indegree');

h=hist(outdegree,bin);

subplot(1,2,2);

bar(bin,h);

title('outdegree');

Page 12: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

12

Power law plotted• 500 random numbers are

generated, following power law with xmin=1, alpah=2

• Subplots C and D are produced using equal bin size (bin size=5)

• To remove the noise in the tail of subplot (D), we need to use log bin size

• Subplot (F) shows a straight line as desired.

• Try the matlab program to experience with the power law

RankCrawlUserGraph Spam

Page 13: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

Generate random numbers

• Generate uniform random numbers – rand(n,1)

• Generate power law random numbers using transformation method

n=500;

alpha=2;

xmin=1;

%generate n random numbers following power law

rawData = xmin*(1-rand(n,1)).^(-1/(alpha-1));

13

Page 14: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

Plot the power law datasubplot(3,2,1);

scatter(1:n, rawData);

title('(A) Scatter plot of 500 random data');

subplot(3,2,2);

scatter(1:n, rawData, rawData.^(0.5),rawData);

title('(B) Crowded dots are plotted in smaller size');

b=5;

bins=1:b:n;

h=hist(rawData, bins);

subplot(3,2,3);

plot(h, 'o');

xlabel('value');

ylabel('frequency');

title('(C) Histogram of equal bin size');

14

Page 15: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

Loglog plotsubplot(3,2,4);

Loglog(bins, h, 'o');

xlabel('value');

ylabel('frequency');

binslog(1)=1;

for j=1:7

b2(j)=2^j

binslog(j+1)=binslog(j)+b2(j);

end;

subplot(3,2,5);

h=hist(rawData, binslog);

plot(binslog, h, 'o');

xlabel('value');

ylabel('frequency');

title('(E)Histogram of log bin size');

15

subplot(3,2,6);

h=hist(rawData, binslog);

plot(log10(binslog), log10(h), 'o');

xlabel('value');

ylabel('frequency');

title('(F) log-log plot of (E)');

Page 16: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

16

Power law of web graph in 1999

• Note that the in/out distributions are slightly different• Out-degree may be better fitted by Mandelbrot law• What about the current web?

– clueWeb data consist of 4 billion web pages.

Page 17: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

17

Scale-free networks

• A network is scale free if the degree distribution follows power law

– Mathematical model behind: Preferential attachment

• Many networks obey power law– Internet at the router and inter domain level– Citation network/co-author network– Collaboration network of actors– Networks formed by interacting genes and proteins– … …– Web graph– Online social network– Semantic web

RankCrawlUserGraph Spam

Page 18: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

18

Other graph properties

– Distance from A to B: the length of the shortest path connecting A to B– Distance from node 0 to node 9: 1

– Length: the average of the distances between all the pairs of nodes

– Diameter: the maximum of the distances

– Strongly connected: for any pair of nodes, there is a path connecting them

RankCrawlUserGraph Spam

Page 19: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

19

Small world

• It is a ‘small world’– Millions of people. Yet, separated by “six degrees” of acquaintance

relationships

– Popularized by Milgram’s famous experiment (1967)

• Mathematically– Diameter of graph is small as compared to overall size N

– Length is proportional to ln (N)– For a fixed average degree

– The diameter of a complete graph never grows (always 1)– This property also holds in random graphs

RankCrawlUserGraph Spam

Page 20: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

20

Bow tie structure of Web • Study of 200 million nodes & 1.5 billion

links– SCC: Strongly connected component (SCC)

in the center.– Up Stream: Lots of pages that link to other

pages, but don’t get linked to (IN)– Down stream: Lots of pages that get linked

to, but don’t link (OUT)– Tendrils, tubes, islands

• Small-world property not applicable to the entire web

– Some parts unreachable– Others have long paths

• Power-law connectivity holds though– Page in-degree (alpha = 2.1), – out-degree (alpha = 2.72)

RankCrawlUserGraph Spam

Page 21: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

21

Empirical numbers for bow-tie

• Maximal diameter– 28 for SCC, 500 for entire graph

• Probability of a path between any 2 nodes– ~1 quarter (0.24)

• Average length – 16 (directed path exists), 7 (undirected)

• Shortest directed path between 2 nodes in SCC: 16-20 links on average

RankCrawlUserGraph Spam

Page 22: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

22

Component properties

• Each component is roughly same size– ~50 million nodes

• Tendrils not connected to SCC– But reachable from IN and can reach OUT

• Tubes: directed paths IN->Tendrils->OUT

• Disconnected components– Maximal and average diameter is infinite

RankCrawlUserGraph Spam

Page 23: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

23

Statistics of web graph

• Distribution of incoming and outgoing connections

• Diameter of the graph: Average and maximal length of the shortest path between any two vertices

• Web site and distribution of pages per site– Consider in project: Concetps/classes distribution per file/site in semantic

web?

• Size of the web graph– Consider in project: What is the size of the semantic web?

RankCrawlUserGraph Spam

Page 24: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

24

Page 25: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

25

Web site size

• Simple estimates suggest over billions nodes

• Distribution of site sizes measured by the number of pages follow a power law distribution

– Note that degree distribution also follows power law

• Observed over several orders of magnitude with an exponent ain the 1.6-1.9 range

RankCrawlUserGraph Spam

Page 26: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

26

Web Size

• The web keeps growing.

• But growth is no longer exponential?

• Who cares?– Media, and consequently the user

– Engine design

– Engine crawl policy. Impact on recall.

• What is size? – Number of web servers/web sites?

– Number of pages?

– Terabytes of data available?

– Size of search engine index?

RankCrawlUserGraph Spam

Page 27: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

27

Difficulties in defining the web size

• Some servers are seldom connected.– Example: Your laptop running a web server– Is it part of the web?

• The “dynamic” web is infinite.– Soft 404: www.yahoo.com/<anything> is a valid page– Dynamic content, e.g.,

– Whether forecast– calendar– Any sum of two numbers is its own dynamic page on Google. Example: “2+4”

• Deep web content– E.g., all the articles in nytimes.

• Duplicates– Static web contains syntactic duplication, mostly due to mirroring (~30%)

Sec. 19.5RankCrawlUserGraph Spam

Page 28: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

28

What can we attempt to measure?

•The relative sizes of search engines – The notion of a page being indexed is still reasonably well defined.

– Already there are problems– Document extension: e.g. engines index pages not yet crawled, by indexing

anchor text.– Document restriction: All engines restrict what is indexed (first n words, only

relevant words, etc.)

Sec. 19.5RankCrawlUserGraph Spam

Anchor text

Bottom of a doc

Page 29: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

29

“Search engine index contains N pages”: Issues

• Can I claim a page is in the index if I only index the first 4000 bytes?

– Usually long documents are not fully indexed. Bottom parts are ignored.

• Can I claim a page is in the index if I only index anchor text pointing to the page?

– E.g., Apple web site may not contain the key word ‘computer’, but many anchor text pointing to Apple contains ‘computer’.

– Hence when people search for ‘computer’, Apple page may be returned

• There used to be (and still are?) billions of pages that are only indexed by anchor text.

RankCrawlUserGraph Spam

Page 30: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

30

Indexable web

• The statically indexable web is whatever search engines index.

• Different engines have different preferences– max url depth, max count/host, anti-spam rules, priority rules, etc.

• Different engines index different things under the same URL:– Frames (e.g., some frames are navigational, should be indexed in a

different way)

– meta-keywords, e.g., put more weight on the title

– document restrictions, document extensions, ...

Sec. 19.5RankCrawlUserGraph Spam

Page 31: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

31

A Ç B = (1/2) * Size AA Ç B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

\ Size A / Size B =

(1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A Ç B

Each test involves: (i) Sampling (ii) Checking

Relative Size from overlap of engines A and B

Sec. 19.5RankCrawlUserGraph Spam

Page 32: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

32

Sampling URLs

• Ideal strategy: Generate a random URL and check for containment in each index.

– Problem: Random URLs are hard to find!

• Enough to generate a random URL contained in a given Engine.

• Approach 1: Generate a random URL contained in a given engine– Suffices for the estimation of relative size

• Approach 2: Random walks / IP addresses– In theory: might give us a true estimate of the size of the web (as opposed to just relative

sizes of indexes)

Sec. 19.5RankCrawlUserGraph Spam

Page 33: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

33

Random URLs from random queries

• Generate random query: how?– Lexicon: 400,000+ words from a web crawl

– Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi

• Get 100 result URLs from engine A

• Choose a random URL as the candidate to check for presence in engine B– Download D. Get list of words. – Use 8 low frequency words as AND query to B– Check if D is present in result set.

Not an Englishdictionary

Sec. 19.5RankCrawlUserGraph Spam

Page 34: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

34

Biases induced by random query

• Query Bias: Large documents have higher probability being captured by queries– Solution: reject some large documents using, e.g., rejection sampling method

• Ranking Bias: Search engine ranks the matched documents and returns only top-k documents.

– Solution: Use conjunctive queries & fetch all– Another solution: modify the estimator

• Checking Bias: Duplicates, impoverished pages omitted

• Document or query restriction bias: – engine might not deal properly with 8 words conjunctive query

• Malicious Bias: – Sabotage by engine

• Operational Problems: – Time-outs, failures, engine inconsistencies, index modification.

Sec. 19.5RankCrawlUserGraph Spam

Page 35: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

35

Random IP addresses

• Generate random IP addresses

• Find a web server at the given address– If there’s one

• Collect all pages from server– From this, choose a page at random

Sec. 19.5RankCrawlUserGraph Spam

Page 36: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

36

Random IP addresses

• Ignored: empty or authorization required or excluded

• [Lawr99] Estimated from observing 2500 servers– 2.8 million IP addresses running crawlable web servers – 16 million total servers– 800 million pages– Also estimated use of metadata descriptors:

– Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%

• OCLC using IP sampling found 8.7 M hosts in 2001

• Netcraft [Netc02] accessed 37.2 million hosts in July 2002

Sec. 19.5RankCrawlUserGraph Spam

Page 37: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

37

Advantages & disadvantages

• Advantages– Clean statistics– Independent of crawling strategies

• Disadvantages– Doesn’t deal with duplication – Many hosts might share one IP, or not accept requests– No guarantee all pages are linked to root page.

– Eg: employee pages – Power law for # pages/hosts generates bias towards sites with few pages.

– But bias can be accurately quantified IF underlying distribution understood– Potentially influenced by spamming (multiple IP’s for same server to avoid

IP block)

Sec. 19.5RankCrawlUserGraph Spam

Page 38: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

38

Random walks

• View the Web as a directed graph

• Build a random walk on this graph– Includes various “jump” rules back to visited sites

– Does not get stuck in spider traps!– Can follow all links!

– Converges to a stationary distribution– Must assume graph is finite and independent of the walk. – Conditions are not satisfied (cookie crumbs, flooding)– Time to convergence not really known (may be too long)

– Sample from stationary distribution of walk

Sec. 19.5RankCrawlUserGraph Spam

Page 39: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

39

Advantages & disadvantages• Advantages

– “Statistically clean” method at least in theory!

– Could work even for infinite web (assuming convergence) under certain metrics.

• Disadvantages– List of seeds is a problem.

– Practical approximation might not be valid.

– Non-uniform distribution– Subject to link spamming

Sec. 19.5RankCrawlUserGraph Spam

Page 40: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

40

Conclusions

• No sampling solution is perfect.

• Lots of new ideas ...

• ....but the problem is getting harder

• Quantitative studies are fascinating and a good research problem

Sec. 19.5RankCrawlUserGraph Spam

Page 41: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

41

Another estimation method

• OR-query of frequent words in a number of languages

• According to such query: – Size of web > 21,450,000,000 on 2007.07.07

– > 25,350,000,000 on 2008.07.03

• But page counts of google search results are only rough estimates.

RankCrawlUserGraph Spam

Page 42: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

42

The Web document collection

• No design/co-ordination• Distributed content creation, linking,

democratization of publishing• Content includes truth, lies, obsolete information,

contradictions … • Unstructured (text, html, …), semi-structured

(XML, annotated photos), structured (Databases)…• Scale much larger than previous text collections …

but corporate records are catching up• Growth – slowed down from initial “volume

doubling every few months” but still expanding• Content can be dynamically generated

– See the next slideThe Web

Page 43: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

43

Documents

• Dynamically generated content (deep web)– Dynamic pages are generated from scratch when the user requests them

– usually from underlying data in a database.

– Example: current status of flight LH 454

– Most (truly) dynamic content is ignored by web spiders.

– It’s too much to index it all.

– Actually, a lot of “static” content is also assembled on the fly (asp, php etc.: headers, date, ads etc)

Page 44: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

44

Web search overall picture

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries

Page 45: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

45

Users

• Use short queries (average < 3)

• Rarely use operators

• Don’t want to spend a lot of time on composing a query

• Only look at the first couple of results

• Want a simple UI, not a search engine start page overloaded with graphics

• Extreme variability in terms of user needs, user expectations, experience, knowledge, . . .

– Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class

• One interface for hugely divergent needs

RankCrawlGraph User Spam

Page 46: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

46

Queries

• Queries have a power law distribution – Power law again !

• a few very frequent queries, a large number of very rare queries

• Examples of rare queries: search for names, towns, books etc

RankCrawlGraph User Spam

Page 47: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

47

Types of queries

• Informational user needs: I need information on something. (~40% / 65%)– “web service”, “information retrieval”

• Navigational user needs: I want to go to this web site. (~25% / 15%)– “hotmail”, “myspace”, “United Airlines”

• Transactional user needs: I want to make a transaction. (~35% / 20%)– Buy something: “MacBook Air”– Download something: “Acrobat Reader”– Chat with someone: “live soccer chat”

• Gray areas– Find a good hub– Exploratory search “see what’s there”

• Difficult problem: How can the search engine tell what the user need or intent for a particular query is?

RankCrawlGraph User Spam

Page 48: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

48

How far do people look for results?

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

RankCrawlGraph User Spam

• 40% users look at first page only

Page 49: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

49

User’s evaluation on result

• Classic IR relevance (as measured by F, or precision and recall) can also be used for web IR.

– Precision: fraction of retrieved instances that are relevant,

– Recall: fraction of relevant instances that are retrieved

– relevant items are to the left of the straight line

– the retrieved items are within the oval.

– The red regions represent errors. On the left these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives).

– Precision and recall are the quotient of the left green region by respectively the oval (horizontal arrow) and the left region (diagonal arrow).

RankCrawlGraph User Spam

Page 50: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

50

Users’ empirical evaluation of results (cont.)

• On the web, precision is more important than recall.– Precision is relative to the top k results– Precision at page 1 or page 10? Precision for the first 20 results?

• Comprehensiveness – must be able to deal with obscure queries– Recall matters when the number of matches is very small

• Quality of pages varies widely– Relevance is not enough

• Other desirable qualities (non IR!!)– Content: Trustworthy, objective, diverse, non-duplicated, well maintained,

coverage of topics for polysemic queries– Web readability: display correctly & fast– No annoyances: pop-ups, etc

• User perceptions may be unscientific, but are significant over a large aggregate

RankCrawlGraph User Spam

Page 51: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

51

Users’ empirical evaluation of engines

• Relevance and validity of results (discussed)

• UI – Simple, no clutter, error tolerant

• Pre/Post process tools provided– Mitigate user errors (auto spell check, search assist,…)– Explicit: Search within results, more like this, refine ...– Anticipative: related searches

• Deal with idiosyncrasies– Web specific vocabulary

– Impact on stemming, spell-check, etc

– Web addresses typed in the search box

RankCrawlGraph User Spam

Page 52: 1 Web Basics Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan – CS345A, Winter

52

Web search overall picture

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web spider

Indexer

Indexes

Search

User

Sec. 19.4.1

links

queries