bm25 is so yesterday - events.static.linuxfound.org is so yesterday ... relevance in solr grant...
TRANSCRIPT
BM25issoYesterdayModernTechniquesforBetterSearch
RelevanceinSolrGrantIngersollCTOLucidworks
Lucene/Solr/MahoutCommitter
if(doc.name.contains(āVikingsā)){doc.boost=100
}
OR
q:(MAINQUERY)OR(name:Vikings)^Y
IndexTime:
QueryTime:
ā¢ TermFrequency:āHowwellatermdescribesadocument?āā¢ Measure:howoftenatermoccursperdocument
ā¢ InverseDocumentFrequency:āHowimportantisatermoverall?āā¢ Measure:howrarethetermisacrossalldocuments
TF*IDF
Score(q,d)=āidf(t)Ā·(tf(tind)Ā·(k+1))/(tf(tind)+kĀ·(1āb+bĀ·|d|/avgdl)tinq
Where:t=term;d=document;q=query;i=indextf(tind)=numTermOccurrencesInDocumentĀ½idf(t)=1+log(numDocs/(docFreq+1))|d|=ā1tindavgdl==(ā|d|)/(ā1))dinidinik=Freeparameter.Usually~1.2to2.0.Increasestermfrequencysaturationpoint.b=Freeparameter.Usually~0.75.Increasesimpactofdocumentnormalization.
BM25 (aka Okapi)
ā¢ Captureandlogprettymucheverythingā¢ Searches,Timeonpage/1stclick,Whatwasnotchosen,etc.
ā¢ PrecisionāOfthoseshown,whatāsrelevant?ā¢ RecallāOfallthatāsrelevant,whatwasfound?ā¢ NDCGāAccountforposition
Measure, Measure, Measure
Magic
Guessing
CoreInformationTheory(akaLucene/Solr)
SearchAids(Facets,DidYouMean,Highlighting)
MachineLearning(Clicks,Recs,Personalization,Userfeedback)
Rules,DomainSpecificKnowledge
fuhgeddaboudit
Content Collaboration Context
Core Solr capabilities: text matching, faceting, spell checking, highlighting
Business Rules for content: landing pages, boost/block, promotions, etc.
Leverage collective intelligence to predict what users will do based on historical,
aggregated data
Recommenders, Popularity, Search Paths
Who are you? Where are you? What have you done previously?
User/Market Segmentation, Roles, Security, Personalization
Next Genera/on Relevance
But What About the Real World? Indexing Edition
NER,TopicDetection,Clustering
Word2Vec,etc.
DomainRules:Synonyms,Regexes,LexicalResources
Extraction
LoadIntoSparkBuildW2V,
PageRank,Topic,ClusteringModels
Offline
Content
Models
But What About the Real World? Query Edition
QueryIntentStrategic,Tactical,
Semanticš
iPad case
Head/Tail/Clickstreamenhancement
UserFactors:Segmentation,Location,History,Profile,Security
Parse
DomainSpecificRulesTransformResults
ā¦
CascadingRerankersLearnToRank(multi-
model),Biascorrections
But What About the Real World? Signals Edition
LoadIntoSpark ClickstreamModelsSignals
QueryAnalysisJobsRecommenders/Personalization
š
iPad case
QueryEdition
Raw
Models
(Exact/OriginalMatch)^X(SloppyPhrase)~M^Y
(ANDQ)^Z(ORQ)^XX
(Expansions/Click/Head/TailBoosts)^YY(PersonalizationBiases)^ZZ
({!ltrā¦})
Filters+Options:security,rules,hardpreferences,categories
The Perfect(?!?) Query* YMMV!
}Precision
Recall
CaveatEmptor!
*Note:therearealotofvariationsonthis.edismaxhandlesmost
LearntoRank
X>Y>Z>XXAllweightscanbelearned
ā¢ Donāttakemywordforit,experiment!ā¢ Goodprimer:
ā¢ http://www.slideshare.net/InfoQ/online-controlled-experiments-introduction-insights-scaling-and-humbling-statistics
ā¢ Rulesarefine,aslongasthearecontained,havealifespanandaremeasuredforeffectiveness
Experimentation, Not Editorialization
ā¢ But Wait, Thereās More!
Fusion Architecture
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Elec*on Load Balancing
ZK N
Shared Config Management
Worker Worker
Apache SparkCluster
Manager
REST
API
Admin UI
Twigkit
LOGS FILE WEB DATABASE CLOUD
HD
FS (O
p*on
al)
Core Services
Connectors
ā¢ ā¢ ā¢
ETL and Query Pipelines
Recommenders/Signals/Rules
NLP
Machine Learning
AlerEng and Messaging
Security
Scheduling
Key Features
Shards Shards
Apache Solr
Worker Worker
Apache SparkCluster
Manager
ā¢ Solr:ā¢ ExtensiveTextRankingFeatures
ā¢ SimilarityModelsā¢ FunctionQueriesā¢ Boost/Block
ā¢ PluggableRerankerā¢ LearntoRankcontribā¢ Multi-tenant
ā¢ Sparkā¢ SparkML(RandomForests,Regression,etc.)ā¢ Largescale,distributedcompute
Demo Details
ā¢ Best Buy Kaggle Competition Data Set
- Product Catalog: ~1.3M
- Signals: 1 month of query, document logs
ā¢ Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib
ā¢ Twigkit UI (http://twigkit.com)
Demo Details
ā¢ http://lucidworks.comā¢ http://lucene.apache.org/solrā¢ http://spark.apache.org/ā¢ https://github.com/lucidworks/spark-solrā¢ https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank
ā¢ BloombergtalkonLTRhttps://www.youtube.com/watch?v=M7BKwJoh96s
Resources