for itcs 6265 professor: wensheng wu present by ta: xu fei
TRANSCRIPT
![Page 1: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/1.jpg)
Lucene in ActionFor ITCS 6265
Professor: Wensheng WuPresent by TA: Xu Fei
![Page 2: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/2.jpg)
What is Lucene“Apache Lucene is a high-performance, full-
featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ”high performance, scalable Information
Retrieval (IR) library.a project in the Apache Software Foundationmature, free, open-sourceimplemented in Java.
![Page 3: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/3.jpg)
full-text indexing and searching“In text retrieval, full text search refers to a
technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ”
“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”
![Page 4: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/4.jpg)
Lucene is populara number of ports or integrations to other
programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc.
1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace,
IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….
![Page 5: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/5.jpg)
Lucene is just a hammer!NOT a ready-to-use search application, like
Googlea software library, a toolkita single compact JAR file (less than 1 MB!)A number of full-featured search applications
have been built on top of Lucene.
![Page 6: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/6.jpg)
What Lucene can do for youadd search capabilities to your applicationindex and make searchable any data that you
can extract text fromLucene doesn’t care about the source of the
data, its format, or even its language, as long as you can derive text from it.
You can even index data stored in your databases, indirectly!
![Page 7: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/7.jpg)
Search Application
Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.
Components for indexingAcquire ContentBuild DocumentAnalyze DocumentIndex Document
Components for searchingSearch User InterfaceBuild QuerySearch QueryRender Results
OthersAdministration InterfaceAnalytics InterfaceScaleout
![Page 8: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/8.jpg)
Ranking formulascore(Q,D) = coord(Q,D) · queryNorm(Q)
· ∑ t in Q ( tf(t in D) · idf(t)2
· t.getBoost() · norm(D) )
tf–idf weight (term frequency–inverse document frequency)
![Page 9: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/9.jpg)
Key index files in LuceneSegments fileFields information fileText information fileFrequency filePosition file
![Page 10: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/10.jpg)
Inverted Index Example Doc 1:
Penn State Football …
football
Doc 2:
Football players … State
Postingid
word doc offset
1 football Doc 1 3
Doc 1 67
Doc 2 1
2 penn Doc 1 1
3 players Doc 2 2
4 state Doc 1 2
Doc 2 13
PostingTable
![Page 11: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/11.jpg)
DemoHow to install Lucene and run the demo
Boolean retrieval example apache – lucene apache + lucene apache lucene
Luke: http://www.getopt.org/luke/A online demo (PHP + Lucene) :
http://tiny.cc/JCA9K
![Page 12: For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei](https://reader035.vdocuments.net/reader035/viewer/2022072006/56649d025503460f949d5f0a/html5/thumbnails/12.jpg)
Reference:Lucene: http://lucene.apache.org/Apache: http://www.apache.org/“Lucene in Action” Chapter 1 and code: LinkLucene index:
http://www.ibm.com/developerworks/library/wa-lucene /
http://lucene.apache.org/java/2_4_0/scoring.htmlhttp://lucene.apache.org/java/2_4_0/api/org/apache/luc
ene/search/Similarity.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Index_%28search_engine
%29http://en.wikipedia.org/wiki/Tf-idf