faceted search and solr

8
1 New York CTO Club December 9, 2009 Daniel Tunkelang, Google Otis Gospodneti!, Sematext Faceted Search 2 Agenda Daniel: ! What is faceted search? ! Why use faceted search? ! Thoughts about design and user experience. Otis: ! What are Lucene and Solr? ! Why use an open-source search library? ! Thoughts about implementation. 3 “Regular” Search Interface: ! User expresses information need as short query. ! Search engine returns ranked, pageable result set. User happy when... ! Top-ranked result satisfies information need. ! At least some result on first page is relevant. User unhappy when... ! No result on first page satisfies information need. ! Results misleadingly appear relevant (bait and switch). 4 Relevance Is Subjective Relevance is defined as a measure of information conveyed by a document relative to a query. It is shown that the relationship between the document and the query, though necessary, is not sufficient to determine relevance. William Goffman, On relevance as a measure, 1964.

Upload: otisg

Post on 15-Jan-2015

9.253 views

Category:

Technology


2 download

DESCRIPTION

An overview of Faceted Search by Daniel Tunkelang and an overview of Faceted Search and Solr by Otis Gospodnetić.

TRANSCRIPT

Page 1: Faceted Search and Solr

1

New York CTO ClubDecember 9, 2009

Daniel Tunkelang, GoogleOtis Gospodneti!, Sematext

Faceted Search

2

Agenda

Daniel:! What is faceted search?

! Why use faceted search?

! Thoughts about design and user experience.

Otis:! What are Lucene and Solr?

! Why use an open-source search library?

! Thoughts about implementation.

3

“Regular” Search

Interface:

! User expresses information need as short query.

! Search engine returns ranked, pageable result set.

User happy when...

! Top-ranked result satisfies information need.

! At least some result on first page is relevant.

User unhappy when...

! No result on first page satisfies information need.

! Results misleadingly appear relevant (bait and switch).

4

Relevance Is Subjective

Relevance is defined as a measure of

information conveyed by a document relative to

a query.

It is shown that the relationship between the

document and the query, though necessary, is

not sufficient to determine relevance.

William Goffman, On relevance as a measure, 1964.

Page 2: Faceted Search and Solr

5

Regular Search Experience

6

Assumptions Are Dangerous

! self-awareness

! self-expression

! model knows best

! answer is a document

! one-shot query

tf-idfPageRank

7

What is Faceted Search?

! Best understood through examples.

" See the following slides.

" Or shop on almost any ecommerce site.

! Facets = multiple ways to organize information.

" Often based on available structured information.

" But not always, e.g., facets obtained via text mining.

! Typical interaction:

" User starts with a full-text search.

" Facets guide query refinement process.

8

Faceted Search for News

Page 3: Faceted Search and Solr

9

Faceted Search for People

10

Faceted Search for Breakfast

12

But Facets are Not a Silver Bullet...

! Screen real estate is finite.

" Choose facets wisely.

" Choose facet values wisely for monster facets.

! Multiple selection within a facet is powerful, but...

" Has to be intuitive, especially AND vs. OR.

" Even trickier for hierarchical facets.

! Search relevance still matters!

" Most faceted search applications rank results.

" Irrelevant results " irrelevant facet refinements.

Page 4: Faceted Search and Solr

13

Exploring Information Science

14

Deliver Precision and Recall

Easier said than done!

Ranking of facet values is an open research topic.

15

Be Careful with Faceted Search!

Cameras have artists?!

16

Clarify, Then Refine

Page 5: Faceted Search and Solr

17

Take-Aways

! Faceted search addresses the subjectivity of relevance and information overload.

! But deploying faceted search effectively requires that you think about user experience.

! Recommended reading:

" My thin book entitled Faceted Search

" Marti Hearst's book on Search User Interfaces

" Peter Morville's upcoming book on Search Patterns

18

Otis Gospodneti!, Sematext

Faceted Search with Lucene & Solr

19

What is / isn't Lucene

! Free, ASL, Java IR library, Jar

! Doug Cutting, ASF, 2001

! Application agnostic: Indexing & Searching

! High performance, scalable

! No dependencies

! Heavily ported

! No: crawler, rich doc parser, turn-key solution

! No: out of the box faceted search-capability... but...

Page 6: Faceted Search and Solr

21

What is/isn't Solr

! Indexing/Search server with HTTP API built on

top of Lucene

! Fast & scalable (distributed search, index

replication)#

! XML, JSON, Ruby, Perl, PHP, javabin

! No: crawler (but Nutch ==> Solr works) #

! Yes: rich text parser

! Yes: Faceted Search out of the box!

22

Solr and Faceted Search

! 3 Types of facets: Field Values (text), Dates,

Queries.

! “Text”: return counts for all/top terms in a field

for a result set - e.g. categories a la Amazon

! Dates: return counts for docs in specified date

ranges

! Queries: return counts for docs that also match

a given query - handy for number ranges (think

prices!)#

23

Facet Field Requirements

! Must be indexed

! Often not tokenized

! Often not altered (lowercase, punctuation) #

! Storing not required

! Multivalued fields OK

24

Turn It On

! 0 facets:! http://host:80/solr/select?q=foo

! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category

! N facets:! http://host:80/solr/select?

q=foo&facet=true&facet.field=category&facet.field=inStock

! facet=true or facet.on

Page 7: Faceted Search and Solr

25

Text Facet Response

<result numFound="4" start="0"/>

<lst name="facet_counts">

<lst name="facet_fields">

<lst name="category">

<int name="electronics">3</int>

<int name="copier">0</int>

</lst>

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

! facet.mincount=1 to

avoid 0-count facet

values

! facet.limit=N to limit to

top N facet values

! facet.missing=true to

catch uncategorized

! lots of other options!

26

Date Facets

! http://.../solr/select/?

q=*:*&rows=0&facet=true&facet.date=timesta

mp&facet.date.start=NOW/DAY-

5DAYS&facet.date.end=NOW/DAY

%2B1DAY&facet.date.gap=%2B1DAY

! (%2B1 ==> +1) #

! Solr Date Math Parser syntax: /HOUR,

+2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,

+6MONTHS+3DAYS/DAY

27

Date Facet Response

<result name="response" numFound="42" start="0"/>

<lst name="facet_counts">

<lst name="facet_dates">

<lst name="timestamp">

<int name="2007-08-11T00:00:00.000Z">1</int>

<int name="2007-08-12T00:00:00.000Z">5</int>

<int name="2007-08-13T00:00:00.000Z">3</int>

<int name="2007-08-14T00:00:00.000Z">7</int>

<int name="2007-08-15T00:00:00.000Z">2</int>

<int name="2007-08-16T00:00:00.000Z">16</int>

<str name="gap">+1DAY</str>

<date name="end">2007-08-17T00:00:00Z</date>

</lst>

28

Query Facets

! http://.../solr/select?

q=shoes&rows=0&facet=true&facet.field=inStoc

k&facet.query=price:

[*+TO+500]&facet.query=price:[500+TO+*]

! Avoids the bucket-at-index-time work-around

! Keep queries disjoint

Page 8: Faceted Search and Solr

29

Query Facet Response

<result numFound="3" start="0"/>

<lst name="facet_counts">

<lst name="facet_queries">

<int name="price:[* TO 500]">3</int>

<int name="price:[500 TO *]">1</int>

</lst>

<lst name="facet_fields">

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

30

UI Integration

! Use Filter Queries via fq

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]&fq=inStock:true

! Important: single request does it all

31

State of Lucene & Solr

! Super healthy community, exploding

development

! Lucene 3.0 – 2009-11-25:

! Performance, faster range queries, clean API, better

Unicode support, more non-English support

! Solr 1.4 – 2009-11-10:

! Performance, new replication, Db indexing, rich-doc

indexing, results clustering, faster response protocol,

deduplication...

32

Lucene, Solr, Enterprise

! Free: Community

! Lucene ~ 600 emails/month (dev: 2000/month)#

! Solr ~1300 emails/month (dev: 800/month)#

! Commercial: Support Subscriptions

! Sematext

! Lucid Imagination