Lucene in ActionFor ITCS 6265
Professor: Wensheng WuPresent by TA: Xu Fei
What is Lucene“Apache Lucene is a high-performance, full-
featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ”high performance, scalable Information
Retrieval (IR) library.a project in the Apache Software Foundationmature, free, open-sourceimplemented in Java.
full-text indexing and searching“In text retrieval, full text search refers to a
technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ”
“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”
Lucene is populara number of ports or integrations to other
programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc.
1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace,
IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….
Lucene is just a hammer!NOT a ready-to-use search application, like
Googlea software library, a toolkita single compact JAR file (less than 1 MB!)A number of full-featured search applications
have been built on top of Lucene.
What Lucene can do for youadd search capabilities to your applicationindex and make searchable any data that you
can extract text fromLucene doesn’t care about the source of the
data, its format, or even its language, as long as you can derive text from it.
You can even index data stored in your databases, indirectly!
Search Application
Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.
Components for indexingAcquire ContentBuild DocumentAnalyze DocumentIndex Document
Components for searchingSearch User InterfaceBuild QuerySearch QueryRender Results
OthersAdministration InterfaceAnalytics InterfaceScaleout
Ranking formulascore(Q,D) = coord(Q,D) · queryNorm(Q)
· ∑ t in Q ( tf(t in D) · idf(t)2
· t.getBoost() · norm(D) )
tf–idf weight (term frequency–inverse document frequency)
Key index files in LuceneSegments fileFields information fileText information fileFrequency filePosition file
Inverted Index Example Doc 1:
Penn State Football …
football
Doc 2:
Football players … State
Postingid
word doc offset
1 football Doc 1 3
Doc 1 67
Doc 2 1
2 penn Doc 1 1
3 players Doc 2 2
4 state Doc 1 2
Doc 2 13
PostingTable
DemoHow to install Lucene and run the demo
Boolean retrieval example apache – lucene apache + lucene apache lucene
Luke: http://www.getopt.org/luke/A online demo (PHP + Lucene) :
http://tiny.cc/JCA9K
Reference:Lucene: http://lucene.apache.org/Apache: http://www.apache.org/“Lucene in Action” Chapter 1 and code: LinkLucene index:
http://www.ibm.com/developerworks/library/wa-lucene /
http://lucene.apache.org/java/2_4_0/scoring.htmlhttp://lucene.apache.org/java/2_4_0/api/org/apache/luc
ene/search/Similarity.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Index_%28search_engine
%29http://en.wikipedia.org/wiki/Tf-idf