search engine chap3

Search Engines

Search EnginesInformation Retrieval in PracticeAll slides Addison Wesley, 2008Web CrawlerFinds and downloads web pages automaticallyprovides the collection for searchingWeb is huge and constantly growingWeb is not under the control of search engine providersWeb pages are constantly changingCrawlers also used for other types of dataRetrieving Web PagesEvery page has a unique uniform resource locator (URL)Web pages are stored on web servers that use HTTP to exchange information with client softwaree.g.,

Retrieving Web PagesWeb crawler client program connects to a domain name system (DNS) serverDNS server translates the hostname into an internet protocol (IP) addressCrawler then attempts to connect to server host using specific portAfter connection, crawler sends an HTTP request to the web server to request a pageusually a GET requestCrawling the Web

Web CrawlerStarts with a set of seeds, which are a set of URLs given to it as parametersSeeds are added to a URL request queueCrawler starts fetching pages from the request queueDownloaded pages are parsed to find link tags that might contain other useful URLs to fetchNew URLs added to the crawlers request queue, or frontierContinue until no more new URLs or disk fullWeb CrawlingWeb crawlers spend a lot of time waiting for responses to requestsTo reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at onceCrawlers could potentially flood sites with requests for pagesTo avoid this problem, web crawlers use politeness policiese.g., delay between requests to same web server

Controlling CrawlingEven crawling a site slowly will anger some web server administrators, who object to any copying of their dataRobots.txt file can be used to control crawlers

Simple Crawler Thread

FreshnessWeb pages are constantly being added, deleted, and modifiedWeb crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collectionstale copies no longer reflect the real contents of the web pagesFreshnessHTTP protocol has a special request type called HEAD that makes it easy to check for page changesreturns information about page, not page itself

FreshnessNot possible to constantly check all pagesmust check important pages and pages that change frequentlyFreshness is the proportion of pages that are freshOptimizing for this metric can lead to bad decisions, such as not crawling popular sitesAge is a better metricFreshness vs. Age

AgeExpected age of a page t days after it was last crawled:

Web page updates follow the Poisson distribution on averagetime until the next update is governed by an exponential distribution

AgeOlder a page gets, the more it costs not to crawl ite.g., expected age with mean change frequency = 1/7 (one change per week)

Focused CrawlingAttempts to download only those pages that are about a particular topicused by vertical search applicationsRely on the fact that pages about a topic tend to have links to other pages on the same topicpopular pages for a topic are typically used as seedsCrawler uses text classifier to decide whether a page is on topicDeep WebSites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Webmuch larger than conventional WebThree broad categories:private sitesno incoming links, or may require log in with a valid accountform resultssites that can be reached only after entering some data into a formscripted pagespages that use JavaScript, Flash, or another client-side language to generate linksSitemapsSitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequencyGenerated by web server administratorsTells crawler about pages it might not otherwise findGives crawler a hint about when to check a page for changesSitemap Example

Distributed CrawlingThree reasons to use multiple computers for crawlingHelps to put the crawler closer to the sites it crawlsReduces the number of sites the crawler has to rememberReduces computing resources requiredDistributed crawler uses a hash function to assign URLs to crawling computershash function should be computed on the host part of each URLDesktop CrawlsUsed for desktop search and enterprise searchDifferences to web crawling:Much easier to find the dataResponding quickly to updates is more importantMust be conservative in terms of disk and CPU usageMany different document formatsData privacy very importantDocument FeedsMany documents are publishedcreated at a fixed time and rarely updated againe.g., news articles, blog posts, press releases, emailPublished documents from a single source can be ordered in a sequence called a document feednew documents found by examining the end of the feedDocument FeedsTwo types:A push feed alerts the subscriber to new documentsA pull feed requires the subscriber to check periodically for new documentsMost common format for pull feeds is called RSSReally Simple Syndication, RDF Site Summary, Rich Site Summary, or ...RSS Example

RSS Example

RSSttl tag (time to live)amount of time (in minutes) contents should be cachedRSS feeds are accessed like web pagesusing HTTP GET requests to web servers that host themEasy for crawlers to parseEasy to find new informationConversionText is stored in hundreds of incompatible file formatse.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDFOther types of files also importante.g., PowerPoint, ExcelTypically use a conversion toolconverts the document content into a tagged text format such as HTML or XMLretains some of the important formatting informationCharacter EncodingA character encoding is a mapping between bits and glyphsi.e., getting from bits in a file to characters on a screenCan be a major source of incompatibilityASCII is basic character encoding scheme for Englishencodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytesCharacter EncodingOther languages can have many more glyphse.g., Chinese has more than 40,000 characters, with over 3,000 in common useMany languages have multiple encoding schemese.g., CJK (Chinese-Japanese-Korean) family of East Asian languages, Hindi, Arabicmust specify encodingcant have multiple languages in one fileUnicode developed to address encoding problemsUnicodeSingle mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languagesUnicode is a mapping between numbers and glyphsdoes not uniquely specify bits to glyph mapping!e.g., UTF-8, UTF-16, UTF-32UnicodeProliferation of encodings comes from a need for compatibility and to save spaceUTF-8 uses one byte for English (ASCII), as many as 4 bytes for some traditional Chinese charactersvariable length encoding, more difficult to do string operationsUTF-32 uses 4 bytes for every characterMany applications use UTF-32 for internal text encoding (fast random lookup) and UTF-8 for disk storage (less space)Unicodee.g., Greek letter pi () is Unicode symbol number 960In binary, 00000011 11000000 (3C0 in hexadecimal)Final encoding is 11001111 10000000 (CF80 in hexadecimal)

Storing the DocumentsMany reasons to store converted document textsaves crawling time when page is not updatedprovides efficient access to text for snippet generation, information extraction, etc.Database systems can provide document storage for some applicationsweb search engines use customized document storage systemsStoring the DocumentsRequirements for document storage system:Random accessrequest the content of a document based on its URLhash function based on URL is typicalCompression and large filesreducing storage requirements and efficient accessUpdatehandling large volumes of new and modified documentsadding new anchor textLarge FilesStore many documents in large files, rather than each document in a fileavoids overhead in opening and closing filesreduces seek time relative to read timeCompound documents formatsused to store multiple documents in a filee.g., TREC Web

TREC Web Format

CompressionText is highly redundant (or predictable)Compression techniques exploit this redundancy to make files smaller without losing any of the contentCompression of indexes covered laterPopular algorithms can compress HTML and XML text by 80%e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF)may compress large files in blocks to make access fasterBigTableGoogles document storage systemCustomized for storing, finding, and updating web pagesHandles large collection sizes using inexpensive computers

BigTableNo query language, no complex queries to optimizeOnly row-level transactionsTablets are stored in a replicated file system that is accessible by all BigTable serversAny changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file systemIf any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over

BigTableLogically organized into rowsA row stores data for a single web page

Combination of a row key, a column key, and a timestamp point to a single cell in the row

BigTableBigTable can have a huge number of columns per rowall rows have the same column groupsnot all rows have the same columnsimportant for reducing disk reads to access document dataRows are partitioned into tablets based on their row keyssimplifies determining which server is appropriateDetecting DuplicatesDuplicate and near-duplicate documents occur in many situationsCopies, versions, plagiarism, spam, mirror sites30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70%Duplicates consume significant resources during crawling, indexing, and searchLittle value to most usersDuplicate DetectionExact duplicate detection is relatively easyChecksum techniquesA checksum is a value that is computed based on the content of the documente.g., sum of the bytes in the document file

Possible for files with different text to have same checksumFunctions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes

Near-Duplicate DetectionMore challenging taskAre web pages with same text context but different advertising or format near-duplicates?A near-duplicate document is defined using a threshold value for some similarity measure between pairs of documentse.g., document D1 is a near-duplicate of document D2 if more than 90% of the words in the documents are the sameNear-Duplicate DetectionSearch: find near-duplicates of a document DO(N) comparisons requiredDiscovery: find all pairs of near-duplicate documents in the collectionO(N2) comparisonsIR techniques are effective for search scenarioFor discovery, other techniques used to generate compact representations

Fingerprints

Fingerprint Example

SimhashSimilarity comparisons using word-based representations more effective at finding near-duplicatesProblem is efficiencySimhash combines the advantages of the word-based similarity measures with the efficiency of fingerprints based on hashingSimilarity of two pages as measured by the cosine correlation measure is proportional to the number of bits that are the same in the simhash fingerprintsSimhash

Simhash Example

Removing NoiseMany web pages contain text, links, and pictures that are not directly related to the main content of the pageThis additional material is mostly noise that could negatively affect the ranking of the pageTechniques have been developed to detect the content blocks in a web pageNon-content material is either ignored or reduced in importance in the indexing process

Noise Example

Finding Content BlocksCumulative distribution of tags in the example web page

Main text content of the page corresponds to the plateau in the middle of the distribution

Finding Content BlocksRepresent a web page as a sequence of bits, where bn = 1 indicates that the nth token is a tagOptimization problem where we find values of i and j to maximize both the number of tags below i and above j and the number of non-tag tokens between i and ji.e., maximize

Finding Content BlocksOther approaches use DOM structure and visual (layout) features

search engine chap3

Documents

web web crawlerstarts

proportion of pages

important pages

topicpopular pages

web page updates

web crawlerfinds

web servers

request queuedownloaded