improving relevance with log information

16

Click here to load reader

Upload: richard-boulton

Post on 02-Jul-2015

464 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Improving relevance with log information

Improving relevance with log analysis

Richard [email protected]

@rboulton

Page 2: Improving relevance with log information

Sources of ranking information

Page 3: Improving relevance with log information

Document text

Term frequency based weights.

Vector models, Cosine

BM25, BM25F

Purely based on document and query

Page 4: Improving relevance with log information

Link analysis

Google – Page Rank

Citation analysis

Page 5: Improving relevance with log information

Historical behaviour

Which results were picked for this query before

How well did results which are similar to this result perform in the past.

Page 6: Improving relevance with log information

How to analyse historical behaviour

Page 7: Improving relevance with log information

Finding past behaviour

Keep logs of searches, together with their results.

Keep click-through information.

Keep track of eventual outcomes (sales, ad views, content downloads).

Page 8: Improving relevance with log information

Hadoop

Distributed data processing

Map-combine-reduce

Very good for log analysis!

Page 9: Improving relevance with log information

Dumbo

Python interface for writing Hadoop jobs.

Very simple to use.

Very poor documentation, sadly.

Some performance penalty for using python, but very good for ad-hoc jobs and rapid development.

Page 10: Improving relevance with log information

Past results

Easy to track results which were picked, but:

New results were never picked

New queries never had results picked

Need massive volume to get anywhere

Page 11: Improving relevance with log information

Past behaviour

Use the history better by building models

Represent documents in terms of features.

Use history to produce a score for each result.

Use machine learning to build a model to predict the score for a set of features.

Use model to produce scores for ranking.

Page 12: Improving relevance with log information

Features

BM25 scores for each field

Review scores

Categories

Prices

Price within a category (dumbo)

Page 13: Improving relevance with log information

Scores

Account for position bias

Model click-throughs for each position

“An Experimental Comparison of Click Position-Bias Models” - Craswell et al.

Account for old data being less relevant

Page 14: Improving relevance with log information

Building a model

Logistic regression

Liblinear / libsvm

Apache Mahout

Neural nets

libfann

Page 15: Improving relevance with log information

Interesting results

BM25 weights for title should be biased 5 times higher than weights for body text.*

Don't need very much data to build a useful model.

* for some sample news data.

Page 16: Improving relevance with log information

Summary

Keep your logs!

Tie searches to results in logs

Dumbo + Hadoop makes adhoc investigation of behaviour easy.