bm25 is so yesterday - events.static.linuxfound.org is so yesterday ... relevance in solr grant...

30

Upload: phamnga

Post on 16-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

BM25issoYesterdayModernTechniquesforBetterSearch

RelevanceinSolrGrantIngersollCTOLucidworks

Lucene/Solr/MahoutCommitter

šŸ˜Š

iPad case

šŸ˜Š

iPad case

šŸ¤“

"ipad accessory"~3 OR "ipad case"~5

1.

15.

šŸ‘Ž

So,whatdoyoudo?

if(doc.name.contains(ā€œVikingsā€)){doc.boost=100

}

OR

q:(MAINQUERY)OR(name:Vikings)^Y

IndexTime:

QueryTime:

ā€¢ TermFrequency:ā€œHowwellatermdescribesadocument?ā€ā€¢ Measure:howoftenatermoccursperdocument

ā€¢ InverseDocumentFrequency:ā€œHowimportantisatermoverall?ā€ā€¢ Measure:howrarethetermisacrossalldocuments

TF*IDF

Score(q,d)=āˆ‘idf(t)Ā·(tf(tind)Ā·(k+1))/(tf(tind)+kĀ·(1ā€“b+bĀ·|d|/avgdl)tinq

Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocumentĀ½idf(t)=1+log(numDocs/(docFreq+1))|d|=āˆ‘1tindavgdl==(āˆ‘|d|)/(āˆ‘1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.

BM25 (aka Okapi)

Lather,Rinse,Repeat

šŸ’”

WWGD?

ā€¢ Captureandlogprettymucheverythingā€¢ Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.

ā€¢ Precisionā€”Ofthoseshown,whatā€™srelevant?ā€¢ Recallā€”Ofallthatā€™srelevant,whatwasfound?ā€¢ NDCGā€”Accountforposition

Measure, Measure, Measure

Magic

Guessing

CoreInformationTheory(akaLucene/Solr)

SearchAids(Facets,DidYouMean,Highlighting)

MachineLearning(Clicks,Recs,Personalization,Userfeedback)

Rules,DomainSpecificKnowledge

fuhgeddaboudit

Content Collaboration Context

Core Solr capabilities: text matching, faceting, spell checking, highlighting

Business Rules for content: landing pages, boost/block, promotions, etc.

Leverage collective intelligence to predict what users will do based on historical,

aggregated data

Recommenders, Popularity, Search Paths

Who are you? Where are you? What have you done previously?

User/Market Segmentation, Roles, Security, Personalization

Next Genera/on Relevance

But What About the Real World? Indexing Edition

NER,TopicDetection,Clustering

Word2Vec,etc.

DomainRules:Synonyms,Regexes,LexicalResources

Extraction

LoadIntoSparkBuildW2V,

PageRank,Topic,ClusteringModels

Offline

Content

Models

But What About the Real World? Query Edition

QueryIntentStrategic,Tactical,

SemanticšŸ˜Š

iPad case

Head/Tail/Clickstreamenhancement

UserFactors:Segmentation,Location,History,Profile,Security

Parse

DomainSpecificRulesTransformResults

ā€¦

CascadingRerankersLearnToRank(multi-

model),Biascorrections

But What About the Real World? Signals Edition

LoadIntoSpark ClickstreamModelsSignals

QueryAnalysisJobsRecommenders/Personalization

šŸ˜Š

iPad case

QueryEdition

Raw

Models

(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y

(ANDQ)^Z(ORQ)^XX

(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ

({!ltrā€¦})

Filters+Options:security,rules,hardpreferences,categories

The Perfect(?!?) Query* YMMV!

}Precision

Recall

CaveatEmptor!

*Note:therearealotofvariationsonthis.edismaxhandlesmost

LearntoRank

X>Y>Z>XXAllweightscanbelearned

ā€¢ Donā€™ttakemywordforit,experiment!ā€¢ Goodprimer:

ā€¢ http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics

ā€¢ Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness

Experimentation, Not Editorialization

ShowUsAlready,WillYou!

ā€¢ But Wait, Thereā€™s More!

Fusion Architecture

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Elec*on Load Balancing

ZK N

Shared Config Management

Worker Worker

Apache SparkCluster

Manager

REST

API

Admin UI

Twigkit

LOGS FILE WEB DATABASE CLOUD

HD

FS (O

p*on

al)

Core Services

Connectors

ā€¢ ā€¢ ā€¢

ETL and Query Pipelines

Recommenders/Signals/Rules

NLP

Machine Learning

AlerEng and Messaging

Security

Scheduling

Key Features

Shards Shards

Apache Solr

Worker Worker

Apache SparkCluster

Manager

ā€¢ Solr:ā€¢ ExtensiveTextRankingFeatures

ā€¢ SimilarityModelsā€¢ FunctionQueriesā€¢ Boost/Block

ā€¢ PluggableRerankerā€¢ LearntoRankcontribā€¢ Multi-tenant

ā€¢ Sparkā€¢ SparkML(RandomForests,Regression,etc.)ā€¢ Largescale,distributedcompute

Demo Details

ā€¢ Best Buy Kaggle Competition Data Set

- Product Catalog: ~1.3M

- Signals: 1 month of query, document logs

ā€¢ Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib

ā€¢ Twigkit UI (http://twigkit.com)

Demo Details

ā€¢ http://lucidworks.comā€¢ http://lucene.apache.org/solrā€¢ http://spark.apache.org/ā€¢ https://github.com/lucidworks/spark-solrā€¢ https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank

ā€¢ BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s

Resources