building a real-time, solr-powered recommendation engine trey grainger manager, search technology...

43
Building a Real-time, Solr- powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Upload: holly-stander

Post on 29-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Building a Real-time, Solr-powered Recommendation Engine

Trey GraingerManager, Search Technology Development

@

Lucene Revolution 2012 - Boston

Page 2: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Overview

• Overview of Search & Matching Concepts• Recommendation Approaches in Solr:• Attribute-based• Hierarchical Classification• Concept-based• More-like-this• Collaborative Filtering• Hybrid Approaches

• Important Considerations & Advanced Capabilities @ CareerBuilder

Page 3: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

My Background

Trey Grainger• Manager, Search Technology Development

@ CareerBuilder.com

Relevant Background• Search & Recommendations• High-volume, N-tier Architectures• NLP, Relevancy Tuning, user group testing, & machine learning

Fun Side Projects• Founder and Chief Engineer @ .com

• Currently co-authoring Solr in Action book… keep your eyes out for the early access release from Manning Publications

Page 4: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

About Search @CareerBuilder

• Over 1 million new jobs each month • Over 45 million actively searchable resumes• ~250 globally distributed search servers (in the

U.S., Europe, & Asia) • Thousands of unique, dynamically generated

indexes• Hundreds of millions of search documents• Over 1 million searches an hour

Page 5: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Search Products @

Page 6: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Redefining “Search Engine”

• “Lucene is a high-performance, full-featured text search engine library…”

Yes, but really…

• Lucene is a high-performance, fully-featured token matching and scoring library… which can perform full-text searching.

Page 7: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Redefining “Search Engine”

or, in machine learning speak:

• A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities.

• Think of each field as a matrix containing each term mapped to each document

Page 8: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

The Lucene Inverted Index (traditional text example)

Term Documentsa doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

… …

Document Content Fielddoc1 once upon a time, in a land

far, far awaydoc2 the cow jumped over the

moon.doc3 the quick brown fox

jumped over the lazy dog.doc4 the cat in the hatdoc5 The brown cow said “moo”

once.… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

Page 9: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Match Text Queries to Text Fields

/solr/select/?q=jobcontent: (software engineer)

Job Content Field Documents… …engineer doc1, doc3, doc4,

doc5…

mechanical doc2, doc4, doc6… …software doc1, doc3, doc4,

doc7, doc8… …

doc5

doc7 doc8

doc1 doc3 doc4

engineer

software

software engineer

Page 10: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

• Lucene/Solr is a text search matching engine

• When Lucene/Solr search text, they are matching tokens in the query with tokens in index

• Anything that can be searched upon can form the basis of matching and scoring:– text, attributes, locations, results of functions, user

behavior, classifications, etc.

Beyond Text Searching

Page 11: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Business Case for Recommendations

• For companies like CareerBuilder, recommendations can provide as much or even greater business value (i.e. views, sales, job applications) than user-driven search capabilities.

• Recommendations create stickiness to pull users back to your company’s website, app, etc.

• What are recommendations?… searches of relevant content for a user

Page 12: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Approaches to Recommendations• Content-based

– Attribute based• i.e. income level, hobbies, location, experience

– Hierarchical• i.e. “medical//nursing//oncology”, “animal//dog//terrier”

– Textual Similarity• i.e. Solr’s MoreLikeThis Request Handler & Search Handler

– Concept Based• i.e. Solr => “software engineer”, “java”, “search”, “open source”

• Behavioral Based • Collaborative Filtering: “Users who liked that also liked this…”

• Hybrid Approaches

Page 13: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Content-based Recommendation Approaches

Page 14: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Attribute-based Recommendations• Example: Match User Attributes to Item Attribute Fields

Janes_Profile:{Industry:”healthcare”, Locations:”Boston, MA”, JobTitle:”Nurse Educator”, Salary:{ min:40000, max:60000 },

}

/solr/select/?q=(jobtitle:”nurse educator”^25 OR jobtitle:(nurse educator)^10) AND ((city:”Boston” AND state:”MA”)^15 OR state:”MA”) AND _val_:”map(salary,40000,60000,10,0)”

//by mapping the importance of each attribute to weights based upon your business domain, you can easily find results which match your customer’s profile without the user having to initiate a search.

Page 15: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Hierarchical Recommendations• Example: Match User Attributes to Item Attribute Fields

Janes_Profile:{MostLikelyCategory:”healthcare//nursing//oncology”, 2ndMostLikelyCategory:”healthcare//nursing//transplant”, 3rdMostLikelyCategory:”educator//postsecondary//nursing”, …

}

/solr/select/?q=(category:((”healthcare.nursing.oncology”^40 OR ”healthcare.nursing”^20 OR “healthcare”^10))

OR (”healthcare.nursing.transplant”^20 OR ”healthcare.nursing”^10 OR “healthcare”^5))

OR (”educator.postsecondary.nursing”^10 OR ”educator.postsecondary”^5 OR “educator”) ))

Page 16: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Textual Similarity-based Recommendations

• Solr’s More Like This Request Handler / Search Handler are a good example of this.

• Essentially, “important keywords” are extracted from one or more documents and turned into a search.

• This results in secondary search results which demonstrate textual similarity to the original document(s)

• See http://wiki.apache.org/solr/MoreLikeThis for example usage

• Currently no distributed search support (but a patch is available)

Page 17: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Concept Based Recommendations

Approaches: 1) Create a Taxonomy/Dictionary to define your concepts and then either:

a) manually tag documents as they come in

or

b) create a classification system which automatically tags content as it comes in (supervised machine learning)

2) Use an unsupervised machine learning algorithm to cluster documents and dynamically discover concepts (no dictionary required).

//Very hard to scale… see Amazon Mechanical Turk if you must do this

//See Apache Mahout

//This is already built into Solr using Carrot2!

Page 18: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

How Clustering Works

Page 19: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

<searchComponent name="clustering" enable=“true“ class="solr.clustering.ClusteringComponent"> <lst name="engine"> <str name="name">default</str> <str name="carrot.algorithm">

org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> <str name="MultilingualClustering.defaultLanguage">ENGLISH</str> </lst></searchComponent> <requestHandler name="/clustering" enable=“true" class="solr.SearchHandler"> <lst name="defaults"> <str name="clustering.engine">default</str> <bool name="clustering.results">true</bool> <str name="fl">*,score</str> </lst> <arr name="last-components"> <str>clustering</str> </arr></requestHandler>

Setting Up Clustering in SolrConfig.xml

Page 20: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Clustering Search in Solr

• /solr/clustering/?q=content:nursing &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25 &group=false //clustering & grouping don’t currently play nicely

• Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results

Page 21: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Search: Nursing

Page 22: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Search: .Net

Page 23: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Example Concept-based Recommendation

Clusters Identifier:Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2)

Original Query: q=(solr or lucene)

// can be a user’s search, their job title, a list of skills, // or any other keyword rich data source

Stage 1: Identify Concepts

Facets Identified (occupation):Computer Software EngineersWeb Developers

...

Page 24: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Example Concept-based Recommendation

q=content:(“Developer”^22 or “Java Developer”^13 or “Software ”^10 or “Senior Java Developer”^9 or “Architect ”^6 or “Software Engineer”^6 or “Web Developer ”^5 or “Search”^3 or “Software Developer”^3 or “Systems”^3 or “Administrator”^2 or “Hadoop Engineer”^2 or “Java J2EE”^2 or “Search Development”^2 or “Software Architect”^2 or “Solutions Architect”^2) and occupation: (“Computer Software Engineers” or “Web Developers”)

// Your can also add the user’s location or the original keywords to the // recommendations search if it helps results quality for your use-case.

Stage 2: Run Recommendations Search

Page 25: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Example Concept-based Recommendation

Stage 3: Returning the Recommendations

Page 26: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Important Side-bar: Geography

Page 27: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Geography and Recommendations• Filtering or boosting results based upon geographical area or

distance can help greatly for certain use cases:– Jobs/Resumes, Tickets/Concerts, Restaurants

• For other use cases, location sensitivity is nearly worthless:– Books, Songs, Movies

/solr/select/?q=(Standard Recommendation Query) AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”

// there are dozens of well-documented ways to search/filter/sort/boost // on geography in Solr.. This is just one example.

Page 28: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Behavior-based Recommendation Approaches(Collaborative Filtering)

Page 29: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

The Lucene Inverted Index (user behavior example)

Term Documentsuser1 doc1, doc5user2 doc2user3 doc2user4 doc1, doc3,

doc4, doc5user5 doc1, doc4… …

Document “Users who bought this product” Field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

Page 30: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Collaborative Filtering• Step 1: Find similar users who like the same documents

q=documentid: (“doc1” OR “doc4”)Document “Users who bought this

product “Field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1… …

Top Scoring Results (Most Similar Users):1) user5 (2 shared likes) 2) user4 (2 shared likes)3) user 1 (1 shared like)

doc1

user1 user4 user5

user4 user5

doc4

Page 31: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

• Step 2: Search for docs “liked” by those similar users

/solr/select/?q=userlikes: (“user5”^2 OR “user4”^2 OR “user1”^1)

Term Documentsuser1 doc1, doc5user2 doc2user3 doc2user4 doc1, doc3,

doc4, doc5user5 doc1, doc4… …

Collaborative Filtering

Top Recommended Documents:1) doc1 (matches user4, user5, user1)2) doc4 (matches user4, user5)3) doc5 (matches user4, user1)4) doc3 (matches user4)

//Doc 2 does not match//above example ignores idf calculations

Most Similar Users:1) user5 (2 shared likes)2) user4 (2 shared likes)3) user 1 (1 shared like)

Page 32: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Lot’s of Variations

• Users –> Item(s)• User –> Item(s) –> Users• Item –> Users –> Item(s)• etc.

Note: Just because this example tags with “users” doesn’t mean you have to. You can map any entity to any other related entity and achieve a similar result.

User 1 User 2 User 3 User 4 …Item 1 X X X …Item 2 X X …Item 3 X X …Item 4 X …… … … … … …

Page 33: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Comparison with Mahout• Recommendations are much easier for us to perform in Solr:

– Data is already present and up-to-date– Doesn’t require writing significant code to make changes (just changing queries)– Recommendations are real-time as opposed to asynchronously processed off-line.– Allows easy utilization of any content and available functions to boost results

• Our initial tests show our collaborative filtering approach in Solr significantly outperforms our Mahout tests in terms of results quality– Note: We believe that some portion of the quality issues we have with the Mahout

implementation have to do with staleness of data due to the frequency with which our data is updated.

• Our general take away:– We believe that Mahout might be able to return better matches than Solr with a lot of custom

work, but it does not perform better for us out of the box.

• Because we already scale…– Since we already have all of data indexed in Solr (tens to hundreds of millions of documents),

there’s no need for us to rebuild a sparse matrix in Hadoop (your needs may be different).

Page 34: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Hybrid Recommendation Approaches

Page 35: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Hybrid Approaches

• Not much to say here, I think you get the point.

• /solr/select/?q=category:(”healthcare.nursing.oncology”^10 ”healthcare.nursing”^5 OR “healthcare”) OR title:”Nurse Educator”^15 AND _val_:”map(salary,40000,60000,10,0)”^5 AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”)

• Combining multiple approaches generally yields better overall results if done intelligently. Experimentation is key here.

Page 36: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Important Considerations & Advanced Capabilities @ CareerBuilder

Page 37: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Important Considerations @ CareerBuilder

• Payload Scoring• Measuring Results Quality• Understanding our Users

Page 38: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Custom Scoring with Payloads• In addition to boosting search terms and fields, content within the same field can also be boosted

differently using Payloads (requires a custom scoring implementation):

• Content Field:design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], …

Payload Bucket Mappings:jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4;

jobdescription: bucket=[] weight=1; experience: bucket=[3] weight=1.5

We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1;

• This allows us to map many relevancy buckets to search terms at index time and adjust the weighting at query time without having to search across hundreds of fields.

• By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model

Page 39: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Measuring Results Quality• A/B Testing is key to understanding our search results quality.

• Users are randomly divided between equal groups

• Each group experiences a different algorithm for the duration of the test

• We can measure “performance” of the algorithm based upon changes in user behavior:– For us, more job applications = more relevant results– For other companies, that might translate into products purchased, additional

friends requested, or non-search pages viewed

• We use this to test both keyword search results and also recommendations quality

Page 40: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Understanding our Users (given limited information)

Page 41: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Understanding Our Users

• Machine learning algorithms can help us understand what matters most to different groups of users.

Example: Willingness to relocate for a job (miles per percentile)

1% 5% 10% 20% 25% 30% 40% 50% 60% 70% 75% 80% 90% 95%0

500

1,000

1,500

2,000

2,500

Title Examiners, Abstractors, and Searchers

Software Developers, Systems Software

Food Preparation Workers

Page 42: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Key Takeaways

• Recommendations can be as valuable or more than keyword search.

• If your data fits in Solr then you have everything you need to build an industry-leading recommendation system

• Even a single keyword can be enough to begin making meaningful recommendations. Build up intelligently from there.

Page 43: Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Contact Info

And yes, we are hiring – come chat with me if you are interested.

Trey [email protected]://www.careerbuilder.com@treygrainger