Download - For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Lucene in ActionFor ITCS 6265

Professor: Wensheng WuPresent by TA: Xu Fei

What is Lucene“Apache Lucene is a high-performance, full-

featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ”high performance, scalable Information

Retrieval (IR) library.a project in the Apache Software Foundationmature, free, open-sourceimplemented in Java.

full-text indexing and searching“In text retrieval, full text search refers to a

technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ”

“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”

Lucene is populara number of ports or integrations to other

programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc.

1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace,

IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….

Lucene is just a hammer!NOT a ready-to-use search application, like

Googlea software library, a toolkita single compact JAR file (less than 1 MB!)A number of full-featured search applications

have been built on top of Lucene.

What Lucene can do for youadd search capabilities to your applicationindex and make searchable any data that you

can extract text fromLucene doesn’t care about the source of the

data, its format, or even its language, as long as you can derive text from it.

You can even index data stored in your databases, indirectly!

Search Application

Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.

Components for indexingAcquire ContentBuild DocumentAnalyze DocumentIndex Document

Components for searchingSearch User InterfaceBuild QuerySearch QueryRender Results

OthersAdministration InterfaceAnalytics InterfaceScaleout

Ranking formulascore(Q,D) = coord(Q,D) · queryNorm(Q)

· ∑ t in Q ( tf(t in D) · idf(t)2

· t.getBoost() · norm(D) )

tf–idf weight (term frequency–inverse document frequency)

Key index files in LuceneSegments fileFields information fileText information fileFrequency filePosition file

Inverted Index Example Doc 1:

Penn State Football …

football

Doc 2:

Football players … State

Postingid

word doc offset

1 football Doc 1 3

Doc 1 67

Doc 2 1

2 penn Doc 1 1

3 players Doc 2 2

4 state Doc 1 2

Doc 2 13

PostingTable

DemoHow to install Lucene and run the demo

Boolean retrieval example apache – lucene apache + lucene apache lucene

Luke: http://www.getopt.org/luke/A online demo (PHP + Lucene) :

http://tiny.cc/JCA9K

http://www.getopt.org/luke/

http://www.getopt.org/luke/

http://tiny.cc/JCA9K

Reference:Lucene: http://lucene.apache.org/Apache: http://www.apache.org/“Lucene in Action” Chapter 1 and code: LinkLucene index:

http://www.ibm.com/developerworks/library/wa-lucene /

http://lucene.apache.org/java/2_4_0/scoring.htmlhttp://lucene.apache.org/java/2_4_0/api/org/apache/luc

ene/search/Similarity.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Index_%28search_engine

%29http://en.wikipedia.org/wiki/Tf-idf

http://lucene.apache.org/

http://www.apache.org/

http://www.manning.com/hatcher3/

http://www.ibm.com/developerworks/library/wa-lucene/

http://www.ibm.com/developerworks/library/wa-lucene/

http://lucene.apache.org/java/2_4_0/scoring.html

http://lucene.apache.org/java/2_4_0/scoring.html

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

http://en.wikipedia.org/wiki/Full_text_search

http://en.wikipedia.org/wiki/Index_(search_engine)

http://en.wikipedia.org/wiki/Index_(search_engine)

http://en.wikipedia.org/wiki/Tf-idf

Download - For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

Top Related