improving relevance with log information
TRANSCRIPT
Sources of ranking information
Document text
Term frequency based weights.
Vector models, Cosine
BM25, BM25F
Purely based on document and query
Link analysis
Google – Page Rank
Citation analysis
Historical behaviour
Which results were picked for this query before
How well did results which are similar to this result perform in the past.
How to analyse historical behaviour
Finding past behaviour
Keep logs of searches, together with their results.
Keep click-through information.
Keep track of eventual outcomes (sales, ad views, content downloads).
Hadoop
Distributed data processing
Map-combine-reduce
Very good for log analysis!
Dumbo
Python interface for writing Hadoop jobs.
Very simple to use.
Very poor documentation, sadly.
Some performance penalty for using python, but very good for ad-hoc jobs and rapid development.
Past results
Easy to track results which were picked, but:
New results were never picked
New queries never had results picked
Need massive volume to get anywhere
Past behaviour
Use the history better by building models
Represent documents in terms of features.
Use history to produce a score for each result.
Use machine learning to build a model to predict the score for a set of features.
Use model to produce scores for ranking.
Features
BM25 scores for each field
Review scores
Categories
Prices
Price within a category (dumbo)
Scores
Account for position bias
Model click-throughs for each position
“An Experimental Comparison of Click Position-Bias Models” - Craswell et al.
Account for old data being less relevant
Building a model
Logistic regression
Liblinear / libsvm
Apache Mahout
Neural nets
libfann
Interesting results
BM25 weights for title should be biased 5 times higher than weights for body text.*
Don't need very much data to build a useful model.
* for some sample news data.
Summary
Keep your logs!
Tie searches to results in logs
Dumbo + Hadoop makes adhoc investigation of behaviour easy.