reflected intelligence - lucene/solr as a self-learning data system: presented by trey grainger,...

64
ected Intelligence: Lucene/Solr as a self-learning data sy Trey Grainger SVP of Engineering, Lucidworks OCTOBER 11-14, 2016 BOSTON, MA

Upload: lucidworks

Post on 16-Apr-2017

384 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger

SVP of Engineering, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 B O S T O N , M A

Page 2: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Trey Grainger SVP of Engineering

• Previously Director of Engineering @ CareerBuilder• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Information Retrieval & Web Search - Stanford University

Fun outside of CB: • Co-author of Solr in Action, plus a handful of research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

Page 3: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

3

Overview

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, disambiguation, concept expansion, rules)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learning(reflected intelligence)

Page 4: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Key Technologies• Keyword Search

- Lucene/Solr• Taxonomies / Entity Extraction

- Solr Text Tagger- Word2Vec / Dice Conceptual Search- SolrRDF

• Query Intent- Probabilistic Query Parser (SOLR-9418)- Semantic Knowledge Graph (SOLR-9480)

• Relevancy Tuning- Solr Learning to Rank Plugin (SOLR-8542)

• General Needs: a solid log processing framework (Apache Spark, Lucidworks Fusion, or Solr Daemon Expression)

Page 5: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

what is “reflected intelligence”?

Page 6: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

The Three C’sContent:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted”

Page 7: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Feedback LoopsUser

Searches

User Sees

ResultsUser

takes an

action

Users’ actions inform system improvements

Page 8: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

● Recommendation Engines● Building user profiles from past searches, clicks, and other actions● Identifying correlations between keywords/phrases● Building out automatically-generated ontologies from content and

queries● Determining relevancy judgements (precision, recall, nDCG, etc.)

from click logs● Learning to Rank - using relevancy judgements and machine

learning to train a relevance model● Discovering misspellings, synonyms, acronyms, and related

keywords● Disambiguation of keyword phrases with multiple meanings● Learning what’s important in your content

Examples of Reflected Intelligence

Page 9: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

01Overview

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, disambiguation, concept expansion, rules)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learning(reflected intelligence)

Page 10: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Basic Keyword Search

The beginning of a typical search journey

Page 11: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far, far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo” once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

Page 12: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4, doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4, doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

Page 13: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Classic Lucene Relevancy Algorithm (now switched to BM25):

*Source: Solr in Action, chapter 3

Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q

Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Page 14: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

• Term Frequency: “How well a term describes a document?”– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

Page 15: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

News Search : popularity and freshness drive relevanceRestaurant Search: geographical proximity and price range are criticalEcommerce: likelihood of a purchase is keyMovie search: More popular titles are generally more relevantJob search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Page 16: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

Page 17: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

Page 18: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}

Page 19: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

For full coverage of building a recommendation engine in Solr…

See Trey’s talk from Lucene Revolution 2012 (Boston):

Page 20: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

Page 21: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Taxonomies / Entity Extraction

Identifying the important entities within your domain

Page 22: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 23: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling

• Clustering of documents

• Statistical Analysis of interesting phrases-Word2Vec / Dice Conceptual Search

• Buy a dictionary (often doesn’t work for domain-specific search problems)

• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*

* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

Page 24: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Differentiating related terms

Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse

Ambiguous Terms: driver => driver (trucking) ~80% likelihood driver => driver (software) ~20% likelihood

Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig

Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

Page 25: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 26: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 27: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 28: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Query Intent

Understanding the meaning of documents and queries

Page 29: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

query parsing

Page 30: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 31: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Probabilistic Query ParserGoal: given a query, predict which combinations of keywords should be combined together as phrases

Example: senior java developer hadoopPossible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,

and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

Page 32: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Input: senior hadoop developer java ruby on rails perl

Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.

Page 33: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Semantic Search Architecture – Query ParsingIdentification of phrases in queries using two steps:

1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.*

2) Also invoke a statistical phrase identifier to dynamically identify unknown phrases using statistics from a corpus of data (language model)

*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

Page 34: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

query augmentation

Page 35: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 36: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java

id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy

id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate

field term postings list

doc pos

desc

a

1 4

2 1

3 1, 5

at1 3

2 4

company 1 6

doing2 6

3 8

engineer1 2

3 3, 7

great 1 5

hard 2 7

hospital 2 5

java 3 6

nurse 2 3

or 3 4

registered 2 2

software1 1

3 2

work2 10

3 9

job_title java developer 3 1

… … … …

field doc term

desc

1 a

at

company

engineer

great

software

2 a

at

doing

hard

hospital

nurse

registered

work

3 a

doing

engineer

java

or

software

work

job_title 1 Software Engineer

… … …

Terms-Docs Inverted IndexDocs-Terms Uninverted IndexDocuments

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Page 37: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Set-theory View

Graph View

How the Graph Traversal Works

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

has_related_skill

has_related_skillhas_related_skill

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

Data Structure View

Java

Scala Hibernate

docs1, 2, 6

docs 3, 4

Oncologydoc 5

Page 38: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Traversal

Data Structure View

Graph Viewdoc 1

doc 2

doc 3

doc 4

doc 5

doc 6

skill: Java

skill: Java

skill: Scala

skill: Hibernate

skill: Oncology

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

job_title: Software Engineer

job_title: Data

Scientist

job_title: Java

Developer

……

Inverted Index Lookup

Doc Values Index Lookup

Doc Values Index Lookup

Inverted Index Lookup

Java

Java Developer

Hibernate

Scala

Software Engineer

Data Scientist

has_related_skill has_related_skill

has_related_skill

has_related_

job_

title

has_related_job_title

has_related_

job_

title

has_related_jo

b_title

has_re

lated_j

ob_title

has_related_job_title

Page 39: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Scoring nodes in the Graph

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 },

{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },

{ "value":".net", "relatedness": 0.5417, "popularity":17683 },

{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },

{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },

{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] } 

+-

Foreground Query: "Hadoop"

Page 40: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.

Knowledge Graph

Multi-level Graph Traversal with Scores

software engineer*(materialized node)

Java

C#

.NET

.NET Developer

Java Developer

HibernateScalaVB.NET

Software Engineer

Data Scientist

SkillNodes

has_related_skillStartingNode

SkillNodes

has_related_skill Job TitleNodes

has_related_job_title

0.900.88 0.93

0.93

0.34

0.74

0.91

0.89

0.74

0.89

0.780.72

0.48

0.93

0.76

0.83

0.80

0.64

0.61

0.780.55

Page 41: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Knowledge Graph

Page 42: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Knowledge Graph

Page 43: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

contextual disambiguation

Page 44: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Two methodologies:

1) Query Log Mining2) Semantic Knowledge Graph

Knowledge Graph

Page 45: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

How do we handle phrases with ambiguous meanings?

Example Related Keywords (representing multiple meanings)driver truck driver, linux, windows, courier, embedded, cdl,

deliveryarchitect autocad drafter, designer, enterprise architect, java

architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer

… …

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Page 46: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Query Log Mining: Discovering ambiguous phrases

1) Classify users who ran each search in the search logs (i.e. by the job title classifications of the jobs to which they applied)

3) Segment the search term => related search terms list by classification, to return a separate related terms list per classification

2) Create a probabilistic graphical model of those classifications mapped

to each keyword phrase.

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Page 47: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Semantic Knowledge Graph: Discovering ambiguous phrases

1) Exact same concept, but use a document classification field (i.e. category) as the first level of your graph, and the related terms as the second level to which you traverse.

2) Has the benefit that you don’t need query logs to mine, but it will be representative of your data, as opposed to your user’s intent, so the quality depends on how clean and representative your documents are.

Page 48: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Disambiguated meanings (represented as term vectors)Example Related Keywords (Disambiguated Meanings)architect 1: enterprise architect, java architect, data architect, oracle, java, .net

2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer

driver 1: linux, windows, embedded2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video

2: graphic, web designer, design, web design, graphic design, graphic designer

3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit

… …

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Page 49: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Using the disambiguated meaningsIn a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning?

1. Any pre-existing knowledge about the user: • User is a software engineer• User has previously run searches for “c++” and “linux”

2. Context within the query:User searched for windows AND driver vs. courier OR driver

3. If all else fails (and there is no context), use the most commonly occurring meaning.

driver 1: linux, windows, embedded2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Page 50: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 51: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Relevancy Tuning

Improving ranking algorithms through experiments and models

Page 52: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

How to Measure Relevancy?

A B CRetrieved Documents

Related Documents

Precision = B/A

Recall = B/C

Problem:

Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK?

Page 53: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Normalized Discounted Cumulative Gain

Rank Relevancy

3 0.95

1 0.70

2 0.60

4 0.45

Rank Relevancy

1 0.95

2 0.85

3 0.80

4 0.65

Ranking

IdealGiven

• Position is considered in quantifying relevancy.

• Labeled dataset is required.

Page 54: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

learning to rank

Page 55: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set of queries

● Features used for ranking are usually more computationally expensive than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

Page 56: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks
Page 57: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Common LTR Algorithms

• RankNet* (Neural Network, boosted trees)

• LambdaMart* (set of regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

Page 58: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

Page 59: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Obtaining Relevancy Judgements• Typical Methodologies

1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale

• 2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

Page 60: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Reflected Intelligence: Possible to infer relevancy judgements?

Rank Document ID

1 Doc1

2 Doc2

3 Doc3

4 Doc4

QueryQuery

Doc1 Doc2 Doc3

01 1

Query

Doc1 Doc2 Doc3

10 0

Click Graph

Skip Graph

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

Page 61: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Automated Relevancy Benchmark System (Offline)

Page 62: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Conclusion

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learning

Page 63: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Additional References:

Page 64: Reflected Intelligence - Lucene/Solr as a self-learning data system: Presented by Trey Grainger, Lucidworks

Contact InfoTrey Grainger

[email protected] @treygrainger

http://solrinaction.comConference discount (39% off): ctwlucsoltw

Other presentations: http://www.treygrainger.com