![Page 1: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/1.jpg)
Apache Solr!
Ramzi Alqrainy!
Search Guy!
Part 1!
![Page 2: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/2.jpg)
What !is Apache Solr ?!
![Page 3: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/3.jpg)
Apache Solr!!
is!
“ a standalone full-text search server with Apache Lucene at the backend. “!
!
!
![Page 4: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/4.jpg)
Cont.!
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. !
!
In brief Apache Solr exposes Lucene's JAVA API as REST like API's which can be called over HTTP from any programming language/platform.!
![Page 5: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/5.jpg)
Why!use Apache Solr ?!
![Page 6: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/6.jpg)
Features!
l Full Text Search!l Faceted navigation!l More items like this(Recommendation)/
Related searches !l Spell Suggest/Auto-Complete!l Custom document ranking/ordering!l Snippet generation/highlighting!And a lot More....!
![Page 7: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/7.jpg)
Why Solr ?!
Also, Solr is only provides :!
1. Result Grouping / Field Collapsing!
2. Query Elevation!
3. Pivot Facet!
4. Pluggable Search/update Workflow!
5. Hash-Based Duplication!
![Page 8: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/8.jpg)
Field Collapsing
“ Collapses a group of results with the same field value down to a single (or fixed number) of entries.”!
For example, most search engines such as Google collapse on site so only one or two entries are shown, along with a link to click to see more results from that site. Field collapsing can also be used to suppress duplicate documents.!
![Page 9: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/9.jpg)
Result Grouping
“ groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups”!
One example is a search at Best Buy for a common term such as DVD, that shows the top 3 results for each category ("TVs & Video","Movies","Computers", etc)!
![Page 10: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/10.jpg)
Query Elevation
enables you to configure the top results for a given query regardless of the normal lucene scoring. This is sometimes called "sponsored search", "editorial boosting" or "best bets".!
![Page 11: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/11.jpg)
Pivot Facet
You can think of it as "Decision Tree Faceting" which tells you in advance what the "next" set of facet results would be for a field if you apply a constraint from the current facet results!
![Page 12: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/12.jpg)
Pluggable Search/update Workflow
You can modify the workflow of existing API endpoints / document instert or updates!
![Page 13: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/13.jpg)
Hash-Based Duplication
Determining the uniqueness of a document not based on ad ID-Field, but the hash signature of a field.!
!
Useful for web pages for example, where the URL may be different but the content the same.!
![Page 14: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/14.jpg)
Boost documents by age!
• Just do a descending sort by age = done?!
• B o o s t m o r e r e c e n t d o c u m e n t s a n d p e n a l i z e o l d e r documents just for being old!
• U s e f u l f o r n e w s , business docs, and local search !
![Page 15: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/15.jpg)
Solr: Indexing!In schema.xml: <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <field name="pubdate" type="tdate" indexed="true" stored="true" required="true" />
Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
![Page 16: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/16.jpg)
FunctionQuery Basics!• FunctionQuery: Computes a value for each
document!– Ranking!– Sorting!
constant literal fieldvalue ord rord sum sub product
pow abs log sqrt map scale query linear
recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist
![Page 17: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/17.jpg)
Solr: Query Time Boost!• Use the recip function with the ms function:!q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine
• Use edismax vs. dismax if possible:!
q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)
• Recip is a highly tunable function!– recip(x,m,a,b) implementing a / (m*x + b) – m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
17
![Page 18: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/18.jpg)
Tune Solr recip function!
18
![Page 19: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/19.jpg)
Tips and Tricks!• Boost should be a multiplier on the relevancy score !
• {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit!q={!boost b=$recency v=$qq}&spellcheck.q=wine
• Bottom out the old age penalty using min:!– min(recip(…), 0.20)
• Not a one-size fits all solution – academic research focused on when to apply it !
19
![Page 20: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/20.jpg)
• Score based on number of unique views!
• Not known at indexing time!
• View count should be broken into time slots!
20
Boost by Popularity!
![Page 21: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/21.jpg)
Popularity Illustrated!
21
![Page 22: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/22.jpg)
Solr: ExternalFileField!In schema.xml: <fieldType name="externalPopularityScore" keyField="id" defVal="1" stored="false" indexed="false" class=”solr.ExternalFileField" valType="pfloat"/> <field name="popularity" type="externalPopularityScore" />
22
![Page 23: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/23.jpg)
Popularity Boost: Nuts & Bolts!
23
Logs Solr Server
User activity logged
View Coun1ng Job
solr-home/data/ external_popularity
a=1.114 b=1.05 c=1.111 …
commit
![Page 24: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/24.jpg)
Popularity Tips & Tricks • For big, high traffic sites, use log analysis!
– Perfect problem for MapReduce!– Take a look at Hive for analyzing large volumes
of log data!
• Minimum popularity score is 1 (not zero) … up to 2 or more!– 1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth
…)!
• Watch out for spell checker “buildOnCommit”!
24
![Page 25: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/25.jpg)
Filtering By User Preferences • Easy approach is to build basic preference
fields in to the index:!– Content types of interest – content_type!– High-level categories of interest - category!– Source of interest – source!
!• We had too many categories and sources that
a user could enable / disable to use basic filtering!– Custom SearchComponent with a connection to a
JDBC DataSource!
25
![Page 26: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/26.jpg)
Preferences Component!
• Connects to a database!• Caches DocIdSet in a Solr FastLRUCache!• Cached values marked as dirty using a
simple timestamp passed in the request!!Declared in solrconfig.xml:! <searchComponent ! class=“demo.solr.PreferencesComponent" ! name=”pref">! <str name="jdbcJndi">jdbc/solr</str> ! </searchComponent>! 26
![Page 28: Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking](https://reader030.vdocuments.net/reader030/viewer/2022020115/5495a316b47959514d8b4df2/html5/thumbnails/28.jpg)
References!
• h5p://wiki.apache.org/solr/ • h5p://www.lucidworks.com/ • Apache Solr 4 Cookbook