Download - Introduction to Search Engines
![Page 1: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/1.jpg)
ENTERPRISE SEARCH
an introduction
![Page 2: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/2.jpg)
Web Search
Desktop Search
Enterprise Search
![Page 3: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/3.jpg)
so what is a
Search Engine?
![Page 4: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/4.jpg)
a SOFTWARE
• that builds index on Text
• answers queries using that index
![Page 5: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/5.jpg)
Any search application has two major
components
SEARCH component
INDEXING component - of importance to us developers
(read headache)
- of importance to the users
![Page 6: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/6.jpg)
data
INDEX FILES
is indexed
user
sends search query
receives search results
INDEXING component
SEARCH component
![Page 7: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/7.jpg)
Let’s start with
INDEXING
![Page 8: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/8.jpg)
is it easy to search here . . .
![Page 9: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/9.jpg)
or here . . .
![Page 10: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/10.jpg)
• that’s information like garbage
• no structure
• comes in all kinds of shapes, sizes, formats
![Page 11: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/11.jpg)
• And this is what indexing does
• Makes data accessible in a structured format, easily accessible through search.
![Page 12: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/12.jpg)
so what all needs to be
Indexed and Searched ?
![Page 13: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/13.jpg)
various FILE FORMATS
Text Files
HTMLPDF
MS Word
PPT
![Page 14: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/14.jpg)
coming from various DATA SOURCES
EmailsCMS
File System
Database
Web Pages
![Page 15: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/15.jpg)
data ( documents )
INDEX FILES
user
sends search query
receives search results
Analyzer
fed to
text that should be indexed
removing stop words such as "a" or "the"
converting all text to lowercase letters for case-insensitive searching
Stemming(A stemming algorithm reduces the words "fishing", "fished",
"fish", and "fisher" to the root word, "fish". )-
Index Writer
tokenized text
![Page 16: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/16.jpg)
Document 1:Coffee isn't my cup of tea.
Document 2: Chocolate, men, coffee - some things are better rich.
INDEXcoffee - 1,2cup - 1 tea - 1chocolate - 1men - 1things - 1better - 1rich - 1
![Page 17: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/17.jpg)
And now the
SEARCH Component
![Page 18: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/18.jpg)
data
INDEX FILES
is indexed
user
receives search results
sends search query
search terms
![Page 19: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/19.jpg)
Search Request Terms
Taxonomy
Spelling IndexCorrect Search Terms + Incorrect Search Terms
Search Terms +Related Terms from Taxonomy + Concept IDs
Search engine(INDEX)
Search results with
1) Actual Location of the result2) Rank3) Details4) Facet Categorization
Results’ Page
![Page 20: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/20.jpg)
introducing
LUCENE
![Page 21: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/21.jpg)
Full-text search library
Open Source
Documents in xml format
Can operate on its own or via Solr
![Page 22: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/22.jpg)
![Page 23: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/23.jpg)
![Page 24: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/24.jpg)
Ways of storing fields of any document:
Indexed means it is searchable
Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “summary associated with a page”
Tokenized means it is run through an Analyzer, that converts the
content into a sequence of tokens
![Page 25: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/25.jpg)
introducing
SOLRSolr
Solr
Lucene
Index
![Page 26: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/26.jpg)
• open source
• handles index/Query to Lucene via HTTP and XML ( also JSON )
• manages document update, add and delete requests to Lucene
• straightforward schema and config files
• comprehensive HTML Admin Interfaces
• highly configurable
![Page 27: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/27.jpg)
Adding Documentsto SOLR
![Page 28: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/28.jpg)
HTTP POST to /update
<add><doc boost=“2”>
<field name=“type”>05991</field>
<field name=“from”>Apache Solr</field>
<field name=“subject”>An intro...</field>
<field name=“category”>search</field>
<field name=“category”>lucene</field>
<field name=“body”>Solr is a full...</field>
</doc></add>
![Page 29: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/29.jpg)
Schema.xml field indexing and display definition
![Page 30: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/30.jpg)
Solrconfig.xml file
defines cache size, faceted field type, request handler customization
![Page 31: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/31.jpg)
Deleting Documents• Delete by Id
<delete><id>05591</id></delete>
• Delete by Query (multiple documents)
<delete>
<query>manufacturer:microsoft</query>
</delete>
![Page 32: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/32.jpg)
Search Results
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
![Page 33: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/33.jpg)
Default Parameters
param
default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl * Stored fields to return
qt standard Query type; maps to query handler
df (schema) Default field to search
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
![Page 34: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/34.jpg)
<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result></response>
![Page 35: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/35.jpg)
Solr Core
Lucene
AdminInterface
StandardRequestHandler
DisjunctionMax
RequestHandler
CustomRequestHandler
Update Handler
Caching
XMLUpdate Interface
Config
Analysis
HTTP Request Servlet
Concurrency
Update Servlet
XMLResponse
Writer
Replication
Schema
Search Requests hit here New document to be added here
![Page 36: Introduction to Search Engines](https://reader035.vdocuments.net/reader035/viewer/2022062404/554b499db4c905b5378b52c0/html5/thumbnails/36.jpg)