search engine architecture 1
TRANSCRIPT
-
8/13/2019 Search Engine Architecture 1
1/23
Search Engine Architecture I
-
8/13/2019 Search Engine Architecture 1
2/23
Software Architecture
The high level structure of a softwaresystem
Software components The interfaces provided by those
components
The relationships between thosecomponents
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 2
-
8/13/2019 Search Engine Architecture 1
3/23
UIMA
An architecture to provide a standard forintegrating search and related languagetechnology components
Unstructured Information ManagementArchitecture (www.research.ibm.com/UIMA)
Defining interfaces for components tosimplify the addition of new technologiesinto systems that handle text and otherunstructured data
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3
-
8/13/2019 Search Engine Architecture 1
4/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 4
UIMA
http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif
-
8/13/2019 Search Engine Architecture 1
5/23
A Good Reference
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 5
-
8/13/2019 Search Engine Architecture 1
6/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6
Primary Goals of Search Engines
Effectiveness (quality): to retrieve the mostrelevant set of documents for a query Process text and store text statistics to improve
relevance
Efficiency (speed): process queries from users asfast as possible Use specialized data structures
Specific goals usually fall into the above primarygoals Example: handling changing document collections
both an effectiveness issue and an efficiency issue
-
8/13/2019 Search Engine Architecture 1
7/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 7
Two Major Functions
Search engine components support twomajor functions
The index process: building data structuresthat enable searching
The query process: using those datastructures to produce a ranked list ofdocuments for a users query
-
8/13/2019 Search Engine Architecture 1
8/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 8
The Indexing Process
-
8/13/2019 Search Engine Architecture 1
9/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 9
Text Acquisition
Identifying and making available the documentsthat will be searched
How? Crawling or scanning the web, a corporate intranet, or
other sources of information
Building a document data store containing the text andmetadata for all the documents
Metadata: document type, document structure, documentlength,
-
8/13/2019 Search Engine Architecture 1
10/23
-
8/13/2019 Search Engine Architecture 1
11/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 11
Document Feeds
A mechanism for accessing a real-timestream of documents
RSS: a common standard used for webfeeds for content such as news, blogs, orvideo
An RSS reader subscribes to RSS feeds, andprovides new content when it arrives
RSS feeds are formatted in XML
-
8/13/2019 Search Engine Architecture 1
12/23
-
8/13/2019 Search Engine Architecture 1
13/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13
Document Data Stores
A simple database to manage large numbers ofdocuments and structured data
Document components are typically stored in acompressed form for efficiency
Structured data consists of document metadataand other information extracted from thedocuments such as links and anchor text
[Discussion] Why do we need document datastores local to search engines?
The original documents are available on the web
-
8/13/2019 Search Engine Architecture 1
14/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 14
The Indexing Process
-
8/13/2019 Search Engine Architecture 1
15/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15
Text Transformation / Index Creation
Text transformation: transforming documents intoindex terms or features Index terms: the parts of a document that are stored in
the index and used in searching
Features: parts of a text document that is used torepresent its content
Examples: phrases, names, dates, links, Index vocabulary: the set of all the terms that are
indexed for a document collection Index creation: creating the indexes or data
structures Example: building inverted list indexes
-
8/13/2019 Search Engine Architecture 1
16/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16
Parsers, Stopping, and Stemming
Processing the sequence of text tokens in thedocument to recognize structural elements suchas titles, figures, links, and headings Tokenization: identifying units to be indexed Using syntax of markup languages to identify structures
Stopping: removing common words from thestream of tokens, e.g., the, of, to, Reducing index size considerably
Stemming: group words that are derived from acommon stem Example: fish, fishes, fishing Increase the likelihood that words used in queries and
documents will match
-
8/13/2019 Search Engine Architecture 1
17/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17
Other Text Transformation Tasks
Link extraction and analysis Links can be indexed separately from the general text content Using link analysis algorithms, e.g., PageRank, to quantify page
popularity and find authority pages
Using anchor text to enhance the text content of a page that the linkpoints to
Information extraction: identifying index terms that aremore complex than single words Entity identification, e.g., finding names
Classifiers: identifying class-related metadata fordocuments or parts of documents Example: finding spam documents and non-content parts of
documents (e.g., ads)
Alternatively, clustering related documents
-
8/13/2019 Search Engine Architecture 1
18/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 18
The Indexing Process
-
8/13/2019 Search Engine Architecture 1
19/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19
Collecting Document Statistics
Gathering and recording statistical informationabout words, features, and documents Statistics will be used to compute scores of documents Stored in lookup tables
Examples Counts of index term occurrences Positions in the documents where the index terms
occurred
Counts of occurrences over groups of documents Lengths of documents in terms of the number of tokens
Actual data depends on the retrieval model andthe associated ranking method
-
8/13/2019 Search Engine Architecture 1
20/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20
Weighting
Calculating index term weights using thedocument statistics and storing them in lookuptables Pre-computation can improve query answering
efficiency
TF/IDF weighting TF (the term frequency): the frequency of index term
occurrences in a document
IDF (inverse document frequency): the inverse of thefrequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documentscontaining a particular term)
-
8/13/2019 Search Engine Architecture 1
21/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21
Inversion and Index Distribution
Inversion: changing the stream of document-terminformation coming from the text transformationcomponent into term-document information for thecreation of inverted indexes Core of the indexing process The number of documents is large The indexes are updated with new documents from
feeds and crawls, and are often compressed for high
efficiency Index distribution: distributing indexes across
multiple computers/sites on a network Document distribution, term distribution, and replication
-
8/13/2019 Search Engine Architecture 1
22/23
J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 22
Summary
A high-level description of search enginesoftware architecture
The indexing process Building blocks and their functionalities
-
8/13/2019 Search Engine Architecture 1
23/23
To-Do List
Expand the figures of the indexing processto include the detailed functionalities
Reach Chapter 2.1-2.3.3
J Pei: Information Retrieval and Web Search -- Search Engine Architecture 23