bm25 is so yesterday - events.static.linuxfound.org is so yesterday ... relevance in solr grant...

BM25issoYesterdayModernTechniquesforBetterSearch

RelevanceinSolrGrantIngersollCTOLucidworks

Lucene/Solr/MahoutCommitter

😊

iPad case

😊

iPad case

🤓

"ipad accessory"~3 OR "ipad case"~5

1.

15.

So,whatdoyoudo?

if(doc.name.contains(“Vikings”)){doc.boost=100

}

OR

q:(MAINQUERY)OR(name:Vikings)^Y

IndexTime:

QueryTime:

• TermFrequency:“Howwellatermdescribesadocument?”• Measure:howoftenatermoccursperdocument

• InverseDocumentFrequency:“Howimportantisatermoverall?”• Measure:howrarethetermisacrossalldocuments

TF*IDF

Score(q,d)=∑idf(t)·(tf(tind)·(k+1))/(tf(tind)+k·(1–b+b·|d|/avgdl)tinq

Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocument½idf(t)=1+log(numDocs/(docFreq+1))|d|=∑1tindavgdl==(∑|d|)/(∑1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.

BM25 (aka Okapi)

Lather,Rinse,Repeat

• Captureandlogprettymucheverything• Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.

• Precision—Ofthoseshown,what’srelevant?• Recall—Ofallthat’srelevant,whatwasfound?• NDCG—Accountforposition

Measure, Measure, Measure

Magic

Guessing

CoreInformationTheory(akaLucene/Solr)

SearchAids(Facets,DidYouMean,Highlighting)

MachineLearning(Clicks,Recs,Personalization,Userfeedback)

Rules,DomainSpecificKnowledge

fuhgeddaboudit

Content Collaboration Context

Core Solr capabilities: text matching, faceting, spell checking, highlighting

Business Rules for content: landing pages, boost/block, promotions, etc.

Leverage collective intelligence to predict what users will do based on historical,

aggregated data

Recommenders, Popularity, Search Paths

Who are you? Where are you? What have you done previously?

User/Market Segmentation, Roles, Security, Personalization

Next Genera/on Relevance

But What About the Real World? Indexing Edition

NER,TopicDetection,Clustering

Word2Vec,etc.

DomainRules:Synonyms,Regexes,LexicalResources

Extraction

LoadIntoSparkBuildW2V,

PageRank,Topic,ClusteringModels

Offline

Content

Models

But What About the Real World? Query Edition

QueryIntentStrategic,Tactical,

Semantic😊

iPad case

Head/Tail/Clickstreamenhancement

UserFactors:Segmentation,Location,History,Profile,Security

Parse

DomainSpecificRulesTransformResults

…

CascadingRerankersLearnToRank(multi-

model),Biascorrections

But What About the Real World? Signals Edition

LoadIntoSpark ClickstreamModelsSignals

QueryAnalysisJobsRecommenders/Personalization

😊

iPad case

QueryEdition

Raw

Models

(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y

(ANDQ)^Z(ORQ)^XX

(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ

({!ltr…})

Filters+Options:security,rules,hardpreferences,categories

The Perfect(?!?) Query* YMMV!

}Precision

Recall

CaveatEmptor!

*Note:therearealotofvariationsonthis.edismaxhandlesmost

LearntoRank

X>Y>Z>XXAllweightscanbelearned

• Don’ttakemywordforit,experiment!• Goodprimer:

• http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

• Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness

Experimentation, Not Editorialization

http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

ShowUsAlready,WillYou!

• But Wait, There’s More!

Fusion Architecture

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Elec*on Load Balancing

ZK N

Shared Config Management

Worker Worker

Apache SparkCluster

Manager

REST

API

Admin UI

Twigkit

LOGS FILE WEB DATABASE CLOUD

HD

FS (O

p*on

al)

Core Services

Connectors

• • •

ETL and Query Pipelines

Recommenders/Signals/Rules

NLP

Machine Learning

AlerEng and Messaging

Security

Scheduling

Key Features

Shards Shards

Apache Solr

Worker Worker

Apache SparkCluster

Manager

• Solr:• ExtensiveTextRankingFeatures

• SimilarityModels• FunctionQueries• Boost/Block

• PluggableReranker• LearntoRankcontrib• Multi-tenant

• Spark• SparkML(RandomForests,Regression,etc.)• Largescale,distributedcompute

Demo Details

• Best Buy Kaggle Competition Data Set

- Product Catalog: ~1.3M

- Signals: 1 month of query, document logs

• Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib

• Twigkit UI (http://twigkit.com)

Demo Details

http://twigkit.com

• http://lucidworks.com• http://lucene.apache.org/solr• http://spark.apache.org/• https://github.com/lucidworks/spark-solr• https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank

• BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s

Resources

http://lucidworks.com

http://lucene.apache.org/solr

http://spark.apache.org/

https://github.com/lucidworks/spark-solr

https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank

https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank

https://www.youtube.com/watch?v=M7BKwJoh96s

https://www.youtube.com/watch?v=M7BKwJoh96s

bm25 is so yesterday - events.static.linuxfound.org is so yesterday ... relevance in solr grant...

Documents