web infomation retrieval

Download wEb infomation retrieval

If you can't read please download the document

Upload: george-ang

Post on 10-May-2015

1.471 views

Category:

Technology


1 download

TRANSCRIPT

  • 1.Tutorial: Web Information RetrievalMonika Henzinger http://www.henzinger.com/~monika

2. What is this talk about? l Topic: Algorithms for retrieving information on theWeb l Non-Topics: Algorithmic issues in classic information retrieval (IR), e.g. stemming String algorithms, e.g. approximate matching Other algorithmic issues related to the Web: networking & routing cacheing security e-commerceM. HenzingerWeb Information Retrieval2 3. Preliminariesl Each Web page has a unique URL Page namehttp://theory.stanford.edu/~focs98/tutorials.html AccessHost name =protocolDomain name Pathl (Hyper) link = pointer from one page to another, loads second page if clicked on l In this talk: document = (Web) pageM. HenzingerWeb Information Retrieval 3 4. Example of a query princess diana Engine 1Engine 2 Engine 3 Relevant and Relevant butNot relevanthigh quality low qualityindex pollution M. Henzinger Web Information Retrieval 4 5. Outlinel Classic IR vs. Web IRl Some IR tools specific to the Web Details on - RankingFor each type- Duplicate elimination - Search-by-example Examples - Measuring search engine index quality Algorithmic issuesl Conclusions Open problems M. HenzingerWeb Information Retrieval 5 6. Classic IRl Input: Document collection l Goal: Retrieve documents or text with informationcontent that is relevant to users information need l Two aspects:1. Processing the collection2. Processing queries (searching) l Reference Texts: SW97, BR99M. Henzinger Web Information Retrieval 6 7. Determining query results model = strategy for determining which documents to return l Logical model: String matches plus AND, OR, NOT l Vector model (Salton et al.): Documents and query represented as vector of terms Vector entry i = weight of term i = function of frequencies within document and within collection Similarity of document & query = cosine of angle of their vectors Query result: documents ordered by similarity l Other models used in IR but not discussed here: Probabilistic model, cognitive model, M. Henzinger Web Information Retrieval7 8. IR on the Webl Input:The publicly accessible Web l Goal: Retrieve high quality pages that are relevant to users need Static (files: text, audio, ) Dynamically generated on request: mostly data baseaccess l Two aspects:1. Processing and representing the collection Gathering the static pages Learning about the dynamic pages2. Processing queries (searching)M. Henzinger Web Information Retrieval 8 9. Whats different about the Web? (1) Pages: l Bulk >1B (12/99) [GL99] l Lack of stability.. Estimates: 23%/day, 38%/week [CG99] l Heterogeneity Type of documents .. Text, pictures, audio, scripts, Quality From dreck to ICDE papers Language .. 100+ l Duplication Syntactic. 30% (near) duplicates Semantic. ?? l Non-running text many home pages, bookmarks, ... l High linkage 8 links/page in the averageM. Henzinger Web Information Retrieval9 10. Typical home page: non-runningtext M. Henzinger Web Information Retrieval 10 11. Typical home page:Non-running text M. Henzinger Web Information Retrieval 11 12. The big challenge Meet the user needsgiventhe heterogeneity of Web pages M. Henzinger Web Information Retrieval 12 13. Whats different about the Web? (2) Users: l Make poor queriesl Specific behavior Short (2.35 terms avg) 85% look over one result Imprecise terms screen only Sub-optimal syntax 78% of queries are not (80% queries withoutmodified operator) Follow links Low effort See various user studies l Wide variance inin CHI, Hypertext, SIGIR, etc. Needs Knowledge Bandwidth M. Henzinger Web Information Retrieval13 14. The bigger challengeMeet the user needs giventhe heterogeneity of Web pagesand the poorly made queries.M. Henzinger Web Information Retrieval 14 15. Why dont the users get whatthey want?Example I need to get rid of mice in User need the basement TranslationUser request Whats the best way toproblems(verbalized) trap mice alive?Query to mouse trap IR system Polysemy SynonymyResultsSoftware, toy cars, inventive products, etc M. HenzingerWeb Information Retrieval15 16. Google output: mouse trapJ M. Henzinger Web Information Retrieval 16 17. Google output: trap miceJ J J J M. Henzinger Web Information Retrieval 17 18. The bright side: Web advantages vs. classic IRUserCollection/tools l Many tools availablel Redundancy l Personalization l Hyperlinks l Interactivity (refine the l Statistics query if needed) Easy to gather Large sample sizes l Interactivity (make the users explain what they want) M. Henzinger Web Information Retrieval18 19. Quantifying the quality of resultsl How to evaluate different strategies?l How to compare different search engines? M. HenzingerWeb Information Retrieval 19 20. Classic evaluation of IR systems We start from a human made relevance judgement for each (query, page) pair and compute:l Precision: % of returned pages that are relevant. RelevantReturnedpages pages l Recall: % of relevant pages that are returned. l Precision at (rank) 10: % of top 10 pages that are relevant l Relative Recall: % of relevant pages found by some means that are returnedAll pages M. Henzinger Web Information Retrieval 20 21. Evaluation in the Web contextl Quality of pages varies widelyl We need both relevance and high quality = value ofpage. Precision at 10: % of top 10 pages that are valuable M. Henzinger Web Information Retrieval 21 22. Web IR toolsl General-purpose search engines: direct: AltaVista, Excite, Google, Infoseek, Lycos, . Indirect (Meta-search): MetaCrawler, DogPile,AskJeeves, InvisibleWeb, l Hierarchical directories: Yahoo!, all portals. l Specialized search engines: Home page finder: AhoyDatabasemostly built Shopping robots: Jango, Junglee,by hand Applet findersM. Henzinger Web Information Retrieval22 23. Web IR tools (cont...)l Search-by-example: Alexas Whats related, Excites More like this, Googles Googlescout, etc. l Collaborative filtering: Firefly, GAB, l l Meta-information: Search Engine Comparisons Query log statistics M. HenzingerWeb Information Retrieval23 24. General purpose search enginesl Search engines components: Spider = Crawler -- collects the documents Indexer -- process and represents the data Search interface -- answers queries M. Henzinger Web Information Retrieval 24 25. Algorithmic issues related tosearch engines l Collectingl Processing andl Processing documents representing thequeries data Priority Query- Query-dependent Loadindependentbalancing ranking ranking Internal Graph Duplicaterepresentationelimination External Index building Query Traprefinementavoidance Duplicateelimination Clustering ... Categorization ... M. HenzingerWeb Information Retrieval25 26. Rankingl Goal: order the answers to a query in decreasing order of value Query-independent: assign an intrinsic value to adocument, regardless of the actual query Query-dependent: value is determined only wrt aparticular query. Mixed: combination of both valuations. l Examples Query-independent: length, vocabulary, publicationdata, number of citations (indegree), etc. Query-dependent: cosine measure M. Henzinger Web Information Retrieval 26 27. Some ranking criterial Content-based techniques (variant of term vector model or probabilistic model) mostly query-dependent l Ad-hoc factors (anti-porn heuristics, publication/location data, ...) mostly query-independent l Human annotations l Connectivity-based techniques Query-independent: PageRank [PBMW98, BP98],indegree [CK97], Query-dependent: HITS [K98], M. HenzingerWeb Information Retrieval27 28. Connectivity analysisl Idea: Mine hyperlink information of the Web l Assumptions: Links often connect related pages A link between pages is a recommendation l Classic IR work (citations = links) a.k.a. Bibliometrics [K63, G72, S73,] l Socio-metrics [K53, MMSM86,] l Many Web related papers build on this idea [PPR96, AMM97, S97, CK97, K98, BP98,]M. HenzingerWeb Information Retrieval 29. Graph representation for the Webl A node for each page ul A directed edge (u,v) if page u contains a hyperlink topage v. M. HenzingerWeb Information Retrieval 29 30. Query-independent ranking:Motivation for PageRankl Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree l Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank [BP 98] M. Henzinger Web Information Retrieval 30 31. Definition of PageRank [BP98]l Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the currentpage with probability 1-d l The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. M. HenzingerWeb Information Retrieval 31 32. PageRank (cont.)Said differently: l Transition probability matrix isd U + (1 d ) Awhere U is the uniform distribution and A is adjacencymatrix (normalized) l PageRank = stationary probability for this Markov chain, i.e.d PageRank (u ) = + (1 d ) PageRank (v) / outdegree(v)n( v ,u )E where n is the total number of nodes in the graph l Used as one of the ranking criteria in Google M. HenzingerWeb Information Retrieval 32 33. Output from Google:princess dianaJ J JJ M. Henzinger Web Information Retrieval 33 34. Query-dependent ranking:the neighborhood graph l Subgraph associated to each queryQuery ResultsBack Set = Start Set Forward Set b1 Result1 f1b2 f2 Result2......bmfs ResultnAn edge for each hyperlink, but no edges within the same host M. HenzingerWeb Information Retrieval 34 35. HITS [K98]l Goal: Given a query find: Good sources of content (authorities) Good sources of links (hubs) M. Henzinger Web Information Retrieval 35 36. Intuition l Authority comes from in-edges. Being a good hub comes from out-edges. l Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities. M. Henzinger Web Information Retrieval 36 37. HITS detailsRepeat until HUB and AUTH converge:Normalize HUB and AUTHHUB[v] := AUTH[ui] for all ui with Edge(v, ui)AUTH[v] := HUB[wi] for all wi with Edge(wi, v) vw1u1w2A Hu2... ...wk ukM. Henzinger Web Information Retrieval37 38. Output from HITS: jobs1. www.ajb.dni.uk - British career website J 2. www.britnet.co.uk/jobs.htm 3. www.monster.com- US career websiteJ 4. www.careermosaic.com- US career website J 5. plasma-gate.weizmann.ac.il/Job 6. www. jobtrak.com- US career website J 7. www.occ.com - US career website J 8. www.jobserve.com- US career website J 9. www.allny.com/jobs.html - jobs in NYC 10.www.commarts.com/bin/ - US career websiteJ M. HenzingerWeb Information Retrieval 38 39. Output from HITS:+jaguar +car1. www.toyota.com 2. www.chryslercars.com 3. www.vw.com 4. www.jaguravehicles.comJ 5. www.dodge.com 6. www.usa.mercedes-benz.com 7. www.buick.com 8. www.acura.com 9. www.bmw.com 10. www.honda.comM. Henzinger Web Information Retrieval 39 40. Problems & solutionsl Some edges are wrong -- not a recommendation: Author A 1/3 Author B multiple edges from same author automatically generated1/3 spam, etc. Solution: Weight edges to limit influence 1/3 l Topic drift Query: +jaguar +carResult: pages about cars in general Solution: Analyze content and assign topic scores to nodesM. HenzingerWeb Information Retrieval 40 41. Modified HITS algorithmsRepeat until HUB and AUTH converge:Normalize HUB and AUTHHUB[v] := AUTH[ui] TopicScore[ui] weight[v,ui] for all ui with Edge(v, ui)AUTH[v] := HUB[wi] TopicScore[wi] weight[wi,v] for all wi with Edge(wi, v)[CDRRGK98, BH98, CDGKRRT98] M. Henzinger Web Information Retrieval41 42. Output from modified HITS:+jaguar +car1. www.jaguarcars.com/- official website of Jaguar carsJ 2. www.collection.co.uk/- official Jaguar accessoriesJ 3. home.sn.no/.../jaguar.html - the Jaguar Enthusiast PlaceJ 4. www.terrysjag.com/ - Jaguar Parts J 5. www.jaguarvehicles.com/ - official website of Jaguar cars J 6. www.jagweb.com/- for companies specializing in Jags. J 7. jagweb.com/jdht/jdht.html - articles about Jaguars and Daimler 8. www.jags.org/ - Oldest Jaguar ClubJ 9. connection.se/jagsport/ - Sports car version of Jaguar MK II 10. users.aol.com/.../jane.htm -Jaguar Association of New England Ltd.M. HenzingerWeb Information Retrieval42 43. User study [BH98]Valuable pages within 10 top answers(averaged over 28 topics)10 9 8Original 7 6 5Edge 4Weighting 3 2EW + 1Content 0AnalysisAuthorities Hubs M. Henzinger Web Information Retrieval 43 44. PageRank vs. HITSl Computation: l Computation: Expensive Expensive Once for all documents Requires computationand queries (offline)for each queryl Query-independent l Query-dependent requires combination with query-dependent criteria l Hard to spam l Relatively easy to spaml Quality depends on qualityof start setl Gives hubs as well asauthorities M. HenzingerWeb Information Retrieval 44 45. Open problemsl Compare performance of query-dependent and query-independent connectivity analysisl Exploit order of links on the page (see e.g.[CDGKRRT98],[DH99])l Both Google and HITS compute principal eigenvector.What about non-principal eigenvector? ([K98])l Derive other graphs from the hyperlink structure M. Henzinger Web Information Retrieval45 46. Algorithmic issues related tosearch engines l Collectingl Processing andl Processing documents representing thequeries data PriorityQuery-Query-dependent Loadindependentbalancing ranking ranking Internal Graph Duplicaterepresentationelimination External Index building Query Traprefinementavoidance Duplicateelimination Clustering ... Categorization ... M. HenzingerWeb Information Retrieval46 47. More on graph representationl Graphs derived from the hyperlink structure of the Web: Node =page Edge (u,v) iff pages u and v are related in a specificway (directed or not) l Examples of edges: iff u has hyperlink to v iff there exists a page w pointing to both u and v iff u is often retrieved within x seconds after vM. HenzingerWeb Information Retrieval 47 48. Graph representation usagel Ranking algorithmsl Visualization/Navigation PageRank Mapuccino HITS[MJSUZB97] WebCutter [MS97] l Search-by-example [DH99] l Structured Web query l Categorization of Web tools pages WebSQL [AMM97] [CDI98] WebQuery [CK97]M. Henzinger Web Information Retrieval 48 49. Example: SRC ConnectivityServer [BBHKV98]Directed edges = Hyperlinks l Goal: Support two basic operations for all URLscollected by AltaVista InEdges(URL u, int k) Return k URLs pointing to u OutEdges(URL u, int k) Return k URLs that u points to l Difficulties: Memory usage (~180 M nodes, 1B edges) Preprocessing time (days ) Query time (~ 0.0001s/result URL) M. Henzinger Web Information Retrieval49 50. URL databaseSorted list of URLs is 8.7 GB ( 48 bytes/URL) Delta encoding reduces it to 3.8 GB ( 21 bytes/URL)Original textDelta Encoding www.foobar.com/ 0 www.foobar.com/ 1 www.foobar.com/gandalf.htm15 gandalf.htm 26 www.foograb.com/7 grab.com/ 41 15 gandalf.htm 26 size of shared prefix Node ID M. HenzingerWeb Information Retrieval 51. Graph data structureNode Tableptr to URL ptr to inlist table ptr to outlist table... ... URL Database... Inlist TableOutlist Table ...... M. HenzingerWeb Information Retrieval51 52. Web graph factoidl >1B nodes (12/99), mean indegree is ~ 8 l Zipfian Degree Distributions [KRRT99]: Fin(i) = fraction of pages with indegree i1Fin(i)~ i 2 .1 Fout(i) = fraction of pages with outdegree i1 Fout (i)~i 2 .38M. Henzinger Web Information Retrieval 52 53. Open problemsl Graph compression: How much compression possible without significant run-time penalty? Efficient algorithms to find frequently repeated smallstructures (e.g. wheels, K2,2) l External memory graph algorithms: How to assign the graph representation to pages so as to reduce paging? (see [NGV96, AAMVV98]) l Stringology: Less space for URL database? Faster algorithms for URL to node translation? l Dynamic data structures: How to make updates efficient at the same space cost?M. HenzingerWeb Information Retrieval 53 54. Algorithmic issues related tosearch engines l Collectingl Processing andl Processing documents representing thequeries data PriorityQuery-Query-dependent Loadindependentbalancing ranking ranking Internal Graph Duplicaterepresentationelimination External Index building Query Traprefinementavoidance Duplicateelimination Clustering ... Categorization ... M. HenzingerWeb Information Retrieval54 55. Index building l Inverted index data structure: Consider all documents concatenated into one huge document For each word keep an ordered array of all positionsin document, potentially compressed Word 1 1st position last position ... ...... l Allows efficient implementation of AND, OR, and AND NOT operationsM. Henzinger Web Information Retrieval 55 56. Algorithmic issues related tosearch engines l Collectingl Processing andl Processing documents representing thequeries data PriorityQuery-Query-dependent Loadindependentbalancing ranking ranking Internal Graph Duplicaterepresentationelimination ExternalIndex building Query Traprefinementavoidance Duplicateelimination Clustering ... Categorization ... M. HenzingerWeb Information Retrieval56 57. Reasons for duplicatesl Proliferation of almost equal documents on the Web: Legitimate: Mirrors, local copies, updates, etc. Malicious: Spammers, spider traps, dynamic URLs Mistaken: Spider errors l Approximately 30% of the pages on the Web are (near)duplicates. [BGMZ97,SG98] M. HenzingerWeb Information Retrieval 57 58. Uses of duplicate informationl Smarter crawlers l Smarter web proxies Better caching Handling broken links l Smarter search engines no duplicate answers smarter connectivity analysis less RAM and diskM. HenzingerWeb Information Retrieval 58 59. 2 Types of duplicate filtering l Fine-grain: Finding near-duplicate documents l Coarse-grain: Finding near-duplicate hosts (mirrors) M. HenzingerWeb Information Retrieval59 60. Fine-grain: Basic mechanisml Must filter both duplicate and near-duplicate documentsl Computing pair-wise edit distance would take foreverl Preferably to store only a short sketch for eachdocument. M. HenzingerWeb Information Retrieval 61. The basics of a solution [B97],[BGMZ97]1. Reduce the problem to a set intersection problem2. Estimate intersections by sampling minima. M. HenzingerWeb Information Retrieval 62. Shinglingl Shingle = Fixed size sequence of w contiguous words a rose is a rose is a rose a rose is arose is a rose is a rose is a rose is arose is a rose DSet of Set ofShingling Fingerprint64 bitshinglesfingerprints M. Henzinger Web Information Retrieval 62 63. Defining resemblance D1D2S1 S2| S1 I S2 |resemblance = | S I S | resemblance = U| S1 1 S2 | 2 | S1 U S 2 | M. Henzinger Web Information Retrieval63 64. Sampling minima l Apply a random permutation to the set [0..264]l Crucial factLet = min( ( S1 )) = min( ( S 2 ))S1 S2| S1 I S 2 |Pr( = ) = | S1 U S 2 | M. HenzingerWeb Information Retrieval 64 65. Implementation l Choose a set of t random permutations of Ul For each document keep a sketch S(D) consisting of tminima = samplesl Estimate resemblance of A and B by counting commonsamplesl The permutations should be from a min-wiseindependent family of permutations. See [BCFM97] forthe theory of mwi permutations. M. HenzingerWeb Information Retrieval 66. If we need only high resemblanceSketch 1: Sketch 2: l Divide sketch into k groups of s samples (t = k * s) l Fingerprint each group feature l Two documents are fungible if they have at leastr common features. l Want Fungibility Resemblance above fixed threshold M. HenzingerWeb Information Retrieval 67. Real implementationl = 90%. In a 1000 word page with shingle length = 8this corresponds to Delete a paragraph of about 50-60 words. Change 5-6 random words. l Sketch size t = 84, divide into k = 6 groups of s = 14 samples l 8 bytes fingerprints we store only 6 x 8 =48 bytes/document l Threshold r = 2 M. HenzingerWeb Information Retrieval 68. Probability that two documentsare deemed fungibleTwo documents with resemblance l Using the full sketch k sk s i P = i (1 ) k s i i = r s l Using features k si P = (1 )ks k ii =r i M. Henzinger Web Information Retrieval 68 69. Features vs. full sketchProbability that two pages are deemed fungible Prob Using full sketchUsing features Resemblance M. Henzinger Web Information Retrieval 70. Fine-grain duplicate elimination:open problems and related workl Best way of grouping samples for a given threshold and/or for multiple thresholds? l Efficient ways to find in a data base pairs of records that share many attributes. Best approach? l Min-wise independent permutations -- lots of open questions. l Other applications possible (images, sounds, ...) -- need translation into set intersection problem. l Related work: M94, BDG95, SG95, H96, FSGMU98M. HenzingerWeb Information Retrieval 70 71. 2 Types of duplicate filteringFine-grain: Finding near-duplicate documents l Coarse-grain: Finding near-duplicate hosts (mirrors) M. Henzinger Web Information Retrieval 71 72. Input: set of URLs l Input: Subset of URLs on various hosts, collected e.g. bysearch engine crawl or web proxy No content of pages pointed to by URLsexcept each page is labeled with its out-linksl Goal: Find pairs of hosts that mirror content M. HenzingerWeb Information Retrieval 72 73. Examplewww.synthesis.org synthesis.stanford.eduabab ddc cwww.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-classroom.html www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced-circ-anal.html www.synthesis.org/Docs/annual.report96.final.htmlsynthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-mechatron.html www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-studies.html www.synthesis.org/Docs/myr5synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html www.synthesis.org/Docs/myr5/cicee/bridge-gap.htmlsynthesis.stanford.edu/Docs/annual.report96.final.html www.synthesis.org/Docs/myr5/cs/cs-meta.htmlsynthesis.stanford.edu/Docs/annual.report96.final_fn.html www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html synthesis.stanford.edu/Docs/myr5/assessment www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-main.html www.synthesis.org/Docs/myr5/synsys/experiential-learning.htmlsynthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-A6-E25.html www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html www.synthesis.org/Docs/yr5ar/assesssynthesis.stanford.edu/Docs/myr5/cicee www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.htmlsynthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.htmlM. Henzinger Web Information Retrieval73 74. Coarse-grain: Basic mechanisml Must filter both duplicate and near-duplicate mirrorsl Pair-wise testing would take foreverl Both high precision (not outputting wrong mirrors) andhigh recall (finding almost all mirrors) are important M. Henzinger Web Information Retrieval 75. A definition of mirroringHost1 and Host2 are mirrors iffFor all paths p such thathttp://Host1/pis a web page,http://Host2/pexists with duplicate (or near-duplicate) content,and vice versa. M. Henzinger Web Information Retrieval75 76. The basics of a solution [BBDH99]1. Pre-filter to create a small set of pairs of potentialmirrors (pre-filtering step)2. Test each pair of potential mirrors (testing step)3. Use different pre-filtering algorithms to improve recallM. HenzingerWeb Information Retrieval 77. Testing stepl Test root pages + x URLs from each host sample l If one test returns not near-duplicate then hosts are not mirrors l If root pages and > c$ x URLs from each host sample are near-identical then hosts are mirrors, else they are not mirrors M. Henzinger Web Information Retrieval77 78. Pre-filtering stepl Goal: Output quickly list of pairs of potential mirrors containing many true mirror pairs (high recall) not many non-mirror pairs (high precision) l Note: 2-sided error is allowed Type-1: true mirror pairs might be missing in output Type-2: non-mirror pair might be output l Testing of host pairs will eliminate type-2 errors, but not type-1 errorsM. HenzingerWeb Information Retrieval 78 79. Different pre-filtering techniquesl IP-based l URL-string based l URL-string and hyperlink based l Hostname and hyperlink based M. HenzingerWeb Information Retrieval 79 80. Problem with IP addresses eliza-iii.ibex.co.nz 203.29.170.23pixel.ibex.co.nz M. HenzingerWeb Information Retrieval 80 81. Number of host with same IPaddress vs mirror probability 0.90.8 Probability of Mirroring0.70.60.50.40.30.20.1 0 23456 7 8 9 >9Number of hosts with same IP addressM. HenzingerWeb Information Retrieval81 82. IP based pre-filtering algorithmsl IP4: Cluster hosts based on IP address Enumerate pairs from clusters in increasingcluster size (max 200 pairs)l IP3: Cluster hosts based on first 3 octets of their IP address Enumerate pairs from clusters in increasingcluster size (max 5 pairs) M. HenzingerWeb Information Retrieval 82 83. URL string based pre-filteringalgorithms Information extracted from URL strings: l Similar hostnames: might belong to same organization l Similar paths: might have replicated directories extract features for host from URL strings and test similarity Similarity Testing Approach: l Feature vector for each host similar to term vector for document: Host corresponds to document Feature corresponds to term l Similarity of hosts = Cosine of angle of feature vectors M. Henzinger Web Information Retrieval 83 84. URL string based algorithms(cont.) paths: Features are paths: e.g., /staff/homepages/~dilbert/foo prefixes: Features are prefixes: e.g.,/staff/staff/homepages/staff/homepages/~dilbert/staff/homepages/~dilbert/foo Other variants: hosts and shingles M. Henzinger Web Information Retrieval 84 85. Paths + connectivity (conn)l Take output from paths and filter thus: Consider 10 common paths in sample with highestoutdegree Paths are equivalent if 90% of their combined out-edges are common to both Keep host-pair if 75% of the paths are equivalent M. Henzinger Web Information Retrieval85 86. Hostname connectivity l Idea: Mirrors point to similar set of other hosts Feature vector approach to test similarity: features are hosts that are pointed to 2 different ways of feature weighting: hconn1 hconn2 M. HenzingerWeb Information Retrieval 86 87. Experiments l Input: 140 million URLs on 233,035 hosts + out-edges Original 179 million URLs reduced by consideringonly hosts with at least 100 URLs in setl For each of the above pre-filtering algorithms: Compute list of 25,000 (100,000) ranked pairs ofpotential mirrors Test each pair of potential mirrors (testing step)and output list of mirrorsDetermine precision and relative recall M. HenzingerWeb Information Retrieval 87 88. Precision up to rank 25,000 M. HenzingerWeb Information Retrieval 88 89. Relative recall up to rank 25,000Relative Recall Rank M. HenzingerWeb Information Retrieval 89 90. Relative recall at 25,000 forcombined output hosts IP3 IP4 conn hconn1 hconn2 paths prefix shingles hosts17% IP339% 30% IP461% 58% 54% conn 58% 66% 80% 47% hconn1 40% 51% 69% 59% 26% hconn2 41% 52% 70% 60% 29%27% paths48% 59% 78% 55% 51%52% 36% prefix 54% 61% 75% 65% 58%58% 57% 44% shingles 53% 61% 75% 64% 57%58% 57% 48% 44%M. HenzingerWeb Information Retrieval90 91. Combined approach (combined)l Combines top 100,000 results from hosts, IP4, paths,prefix, and hconn1. l Sort host pairs by: Number of algorithms that return the host pair Use best rank for any algorithm to break ties between host pairs l At rank 100,000: relative recall of 86%, precision of 57%M. Henzinger Web Information Retrieval 91 92. Precision vs relative recall M. HenzingerWeb Information Retrieval 92 93. Web host graphl A node for each host h l An undirected edge (h,h) if h and h are output asmirrors Each (connected) component gives a set of mirrors M. HenzingerWeb Information Retrieval 93 94. Example of a componentProtein Data Bank M. Henzinger Web Information Retrieval 94 95. Component size distribution 100000 43,491 mirrored 15650hosts Number of Components 100002204of 233,035 1000553considered 185 95100 4926 18 1310 106 6 4 3123 456 7 8910 11 12 13 14 15 Component Size M. HenzingerWeb Information Retrieval 95 96. Coarse-grain duplicate filtering:Summary and open problems l Mirroring is common (43,491 mirrored hosts out of 233,035 considered hosts) Load balancing, franchises/branding, virtual hosting, spam l Mirror detection based on non-content attributes is feasible. l [CSG00] use page content similarity based approach. Open Problem: Compare and combine content and non- content techniques. l Open Problem: Assume you can choose which URLs to visit at a host. Determine best technique.M. Henzinger Web Information Retrieval96 97. Algorithmic issues related tosearch engines l Collectingl Processing andl Processing documents representing thequeries data PriorityQuery-Query-dependent Loadindependentbalancing ranking ranking Internal Graph Duplicaterepresentationelimination ExternalIndex building Query Traprefinementavoidance Duplicateelimination Clustering ... Categorization ... M. HenzingerWeb Information Retrieval97 98. Adding pages to the index Crawling processGet link attop of queue Expired pagesfrom index Queue of Fetch pagelinks to exploreIndex page andparse linksAdd URLAdd to queue M. Henzinger Web Information Retrieval 98 99. Queuing disciplinel Standard graph exploration: Random BFS DFS (+ depth limits) l Goal: get best pages for a given index size Priority based on query-independent ranking: highest indegree [M95] highest potential PageRank [CGP98] l Goal: keep index fresh Priority based on rate of change [CLW97] M. Henzinger Web Information Retrieval99 100. Load balancingl Internal -- can not handle too much retrieved data simultaneously, but Response time is unpredictable Size of answers is unpredictable There are additional system constraints (# threads, #open connections, etc.) l External Should not overload any server or connection A well-connected crawler can saturate the entireoutside bandwidth of some small countries Any queuing discipline must be acceptable to thecommunity M. Henzinger Web Information Retrieval 100 101. Web IR Tools General-purpose search engines lHierarchical directories lSpecialized search engines(dealing with heterogeneous data sources) lSearch-by-example lCollaborative filtering lMeta-information M. Henzinger Web Information Retrieval 101 102. Hierarchical directoriesBuilding of hierarchical directories: l Manual: Yahoo!, LookSmart, Open Directory l Automatic: Populating of hierarchy [CDRRGK98]: For each nodein the hierarchy formulate fine-tuned query and runmodified HITS algorithm Categorization: For each document find bestplacement in the hierarchy. Techniques are connectivityand/or text based [CDI98, ] M. Henzinger Web Information Retrieval 102 103. Web IR Tools General-purpose search enginesHierarchical directories lSpecialized search engines(dealing with heterogeneous data sources) Shopping robots Home page finder [SLE97] Applet finders lSearch-by-example lCollaborative filtering lMeta-informationM. Henzinger Web Information Retrieval 103 104. Dealing with heterogeneoussources l Modern life problem: Given information sources with various capabilities, query all of them and combine the output.l Examples Inter-business e-commerce e.g. www.industry.net Meta search engines Shopping robots l Issues Determining relevant sources -- the identificationproblem Merging the results -- the fusion problemM. Henzinger Web Information Retrieval104 105. Example: a shopping robot l Input: A product description in some form Find: Merchants for that product on the Web l Jango [DEW97] preprocessing: Store vendor URLs in database;learn for each vendor: the URL of the search form how to fill in the search form and how the answer is returned request processing: fill out form at every vendor andtest whether the result is a success range of products is predetermined M. HenzingerWeb Information Retrieval105 106. Jango input example M. Henzinger Web Information Retrieval 106 107. Jango output M. Henzinger Web Information Retrieval 107 108. Direct query to K&L Wine M. Henzinger Web Information Retrieval 108 109. What price decadence? M. Henzinger Web Information Retrieval 109 110. Open problemsl Going beyond the lowest common capability l Learning problem: automatic understanding of newinterfaces l Efficient ways to determine which DB is relevant to aparticular query l Partial knowledge indexing: indexer has limited access tothe full DB M. HenzingerWeb Information Retrieval110 111. Web IR Tools General-purpose search enginesHierarchical directoriesSpecialized search engines(dealing with heterogeneous data sources) l Search-by-example l Collaborative filtering l Meta-information M. HenzingerWeb Information Retrieval 111 112. Search-by-examplel Given: set S of URLs l Find: URLs of similar pages l Various Approaches: Connectivity-based Usage based: related pages are pages visitedfrequently after S Query refinement l Related Work:G72, G79, K98, PP97,S97, CDI98 M. HenzingerWeb Information Retrieval 112 113. Output from Google:related:www.ebay.com M. Henzinger Web Information Retrieval 113 114. Output from Alexa:www.ebay.com M. Henzinger Web Information Retrieval 114 115. Connectivity based solutions[DH99] l Algorithm Companionl Algorithm Co-citation M. HenzingerWeb Information Retrieval 115 116. Algorithm Companionl Build modified neighborhood graph N.l Run modified HITS algorithm on N.Major Question: How to form neighborhood graph s.t. top returned pages are useful related pages M. Henzinger Web Information Retrieval 116 117. Building neighborhood graph N l Node set: From URL u go back, forward, back- forward, and forward-back l Edge set: Directed edge if there is a hyperlink between 2 nodes l Apply refinements to N u M. Henzinger Web Information Retrieval117 118. Refinement 1: Limit out-degree l Motivation: Some nodes have high out-degree =>graph would become too largel Limit out-degree when going forward Going forward from u : choose first 50 out-linkson page Going forward from other nodes: choose 8 out-links surrounding the in-link traversed to reachu uM. Henzinger Web Information Retrieval118 119. Co-citation algorithml Determine 2000 arbitrary back-nodes of u. l Add to set S of siblings of u: For each back node 8 forward-nodes surrounding the link to u u l If there is enough co-citation with u then return nodes in S in decreasing order of co- citations else restart algorithm with one path element removed (http:///X/Y/ -> http:///X/)M. Henzinger Web Information Retrieval119 120. Alexas Whats Relatedl Uses: Document Content Usage Data Connectivity l Removes path elements if no answer for u is found M. HenzingerWeb Information Retrieval 120 121. User studyV a lua b le p a g e s within 1 0 to p a nswe rs a ve ra g e d o ve r 5 9 (=A L L ) o r 3 7 (=INTE RS E C T.) q ue rie s V a lua b le p a g e s within to p 1 010 9 8 7 A le xa 6 5 4 C o cita tio n 3 2 1 C o m p a nio n 0 A L L Q UE RIE S INTE RS E C TIO N Q UE RIE S M. HenzingerWeb Information Retrieval 121 122. Web IR ToolsGeneral-purpose search engines: Hierarchical directories Specialized search engines: Search-by-example l Collaborative filtering l Meta-information M. Henzinger Web Information Retrieval 122 123. Collaborative filteringUser ExCollected inputpli input cit prefer en ce s predictionanalysis phasephaseSuggestionsModel M. Henzinger Web Information Retrieval 123 124. Lots of projectsl Collaborative filtering seldom used in classic IR, big revival on the Web. Projects: PHOAKS -- ATT labs Web pages recommendation based on Usenet postings GAB -- Bellcore Web browsing Grouplens -- U. Minnesota Usenet newsgroups EachToEach -- Compaq SRC rating movies See http://sims.berkeley.edu/resources/collab/M. HenzingerWeb Information Retrieval124 125. Why do we care?l The ranking schemes that we discussed are also a form of collaborative ranking! Connectivity = people vote with their links Usage = people vote with their clicks l These schemes are used only for a global model building. Can it be combined with per-user data? Ideas: Consider the graph induced by the users bookmarks. Profile the user -- see www.globalbrain.net Deal with privacy concerns! M. HenzingerWeb Information Retrieval 125 126. Web IR ToolsGeneral-purpose search engines: Hierarchical directories Specialized search engines: Search-by-example Collaborative filtering l Meta-information Comparison ofsearch engines Log statisticsM. Henzinger Web Information Retrieval 126 127. Comparison of search engines Ideal measure: User satisfaction Difficulty of Number of user requestsindependentmeasurement; Quality of search engine index Usefulness forComparison Size of search engine index M. Henzinger Web Information Retrieval 127 128. Comparing Search Engine Sizesl Nave Approaches Get a list of URLs from each search engine andcompare Not practical or reliable. Result Set Size Comparison Reported sizes are approximate. l Better Approach Statistical SamplingM. Henzinger Web Information Retrieval128 129. URL sampling l Ideal strategy: Generate a random URL and check for containment in each index. l Random URLs are hard to generate: Random walks methods Graph is directed Stationary distribution is non-uniform Must prove rapid mixing. Pages in cache, query logs [LG98a], etc. Correlated to the interests of a particular group ofusers. l A simple way: collect all pages on the Web and pick one at random. M. HenzingerWeb Information Retrieval 129 130. Sampling via queries [BB98]l Search engines have the best crawlers -- why notexploit them? l Method: Sample from each engine in turn Estimate the relative sizes of two search engines Compute absolute sizes from a reference point M. Henzinger Web Information Retrieval 130 131. Estimate relative sizes Select pages randomly fromA (resp. B)Check if page contained inB (resp. A) |A B| (1/2) * |A| |A B| (1/6) * |B| |B| 3 * |A|Two steps: (i) Selecting (ii) Checking M. HenzingerWeb Information Retrieval 131 132. Selecting a random page l Generate random query Build a lexicon of words that occur on the Web Combine random words from lexicon to form queries l Get the first 100 query results from engine A l Select a random page out of the set l Distribution is biased -- the conjecture is that p( D) | A B | D A B ~ p( D) | A | D Awhere p(D) is the probability that D is picked by this schemeM. HenzingerWeb Information Retrieval 132 133. Checking if an engine has a page l Create a unique query for the page: Use 8 rare words. E.g., for the Digital Systems Research Center Home Page: M. HenzingerWeb Information Retrieval133 134. Results of the BB98 study 35 0 S ta tic W e b 24 0U nio n12 5A lta V is ta 10 0 Status as of July 98H o tB o t Web size: 350 million pages45E xc ite Growth: 25M pages/month 37Info s e e k Six largest engines cover: 2/375 Small overlap: 3M pages N o rthe rn Light21Ly c o s025 5075100 125 150 175 200 225 250275 300 325 350375Size in M illions of P a ges Jun-9 7N ov-97 M a r -98Jul-98M. HenzingerWeb Information Retrieval 134 135. Crawling strategies are different!Exclu siv e listin g s in m illio n s o f p ag es L yco s 5 Jul-98 N o rth ern Lig ht 20 A ltaV ista50 H o tB o t 45E xcite 13 Info seek 8 010 20 30405060M. Henzinger Web Information Retrieval135 136. Comparison of search engines Ideal measure: User satisfaction Difficulty of Number of user requestsindependentmeasurement; Quality of search engine index Usefulness forComparisonSize of search engine index M. Henzinger Web Information Retrieval 136 137. Quality: A general definition[HHMN99]l Assign each page p a weight w(p) such that w(p) = 1all p Can be thought of as probability distribution on pagesl Quality of a search engine index S is w(S) = w( p) pSl Example: If w is same for all pages, weight is proportional tototal size (in pages). l Average page quality in index S is w(S)/|S|. l We use: weight w(p) of a page p is its PageRank M. HenzingerWeb Information Retrieval 137 138. Estimating quality by samplingSuppose we can choose random pages according to w (so that page p appears with probability w(p)) l Choose a sample of pages p1,p2,p3 pn l Check if the pages are in search engine index S l Estimate for quality of index S is the percentage of sampled pages that are in S, i.e. 1 w (S ) = I [ p j S ] n jwhere I[pj in S] = 1 if pj is in S and 0 otherwise M. HenzingerWeb Information Retrieval 138 139. Missing pieces l Sample pages according to the PageRank distribution. l Test whether page p is in search engine index S. same methodology as [BB98] M. HenzingerWeb Information Retrieval139 140. Sampling pages (almost)according to PageRankl Perform a random walk and select n random pages from it.l Problems: Starting state bias: finite walk only approximates PageRank. Cant jump to a random page; instead, jump to a random page on a random host seen so far. Sampling pages according to a distribution that behavessimilarly to PageRank, but it not identical to PageRank M. Henzinger Web Information Retrieval 140 141. Experimentsl Performed two long random walks with d=1/7 starting at www.yahoo.comWalk 1Walk2 length18 hours54 hours attempted downloads 2,867,466 6,219,704HTML pages1,393,265 2,940,794 successfully downloaded unique HTML pages 509,279 1,002,745 sampled pages 1,025 1,100M. HenzingerWeb Information Retrieval141 142. Random walk effectivenessl Pages (or hosts) that are highly-reachable are visited often by the random walks l Initial bias for www.yahoo.com is reduced in longer walk l Results are consistent over the 2 walks l The average indegree of pages with indegree