search as communication: lessons from a personal journey

Search as Communica/on: Lessons from a Personal Journey

Daniel Tunkelang Head of Query Understanding, LinkedIn

These are great textbooks on informa/on retrieval.

Unfortunately, I never read them in school.

But I did study graphs and stuff.

I found myself developing a search engine.

And the next thing I knew, I was a search guy.

So what did I learn along the way?

Search isn't a ranking problem. It's a communica/on problem.

Outline

1. Lessons from Library Science 2. Adventures with InformaAon ExtracAon 3. A Moment of Clarity

1. Lessons from Library Science

InformaAon need query select from results

rank using IR model

USER:

SYSTEM: M-‐idf PageRank

A birds-‐eye view of how search engines work.

Old school search: ask a librarian.

Search lives in an informa/on-‐seeking context.

[Pirolli and Card, 2005]

vs.

Recognize ambiguity and ask for clarifica/on.

Clarify, then refine.

Computers Books

Faceted search. It’s not just for e-‐commerce.

Give users transparency, guidance, and control.

Take-‐away for search engine developers:

Act like a librarian. Communicate with your user.

2. Adventures with Informa/on Extrac/on

String matching is great but has limits.

20 20

for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!

People search for en//es. Recognize them!

Named en/ty recogni/on is free, as in free beer.

Problem: they process each document separately.

EnAty DetecAon System

Why not take advantage of corpus features?

Give your documents the right to vote!

Use a high-‐recall method to collect candidates. •  e.g., all Atle-‐case spans of words other

than single word beginning a sentence. Process each document separately.

•  Each candidate is assigned an enAty type, or no type at all.

If a candidate is mostly assigned a single enAty type, extrapolate to all its occurrences.

Looking for topics? Use idf, and its cousin ridf.

Inverse document frequency (idf) •  Too low? Probably a stop word. •  Too high? Could be noise. Residual inverse document frequency (ridf) •  Predict idf using Poisson model. •  Difference between idf and predicted idf.

“a good keyword is far from Poisson” [Church and Gale, 1995]

Terminology extrac/on? Try data recycling.

Obtain en//es by any means necessary.


En/ty detec/on is crucial. And it isn’t that hard.

3. A Moment of Clarity

informaAon Need query select from results

rank using IR model

USER:

SYSTEM: M-‐idf PageRank

Let’s go back to our pigeons for a moment.

What does this process look like to the system?

vs.

And here’s what it looks like to the user.

GOOD NOT SO GOOD

But can the system tell the difference?

User experience should reflect system confidence.

vs.

h^p://searchengineland.com/ge`ng-‐organized-‐paid-‐search-‐user-‐intent-‐the-‐search-‐funnel-‐116312 Derived from [Jansen et al, 2007].

Searches reflect a variety of informa/on needs.

34 34

for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!

We can segment informa/on need from the query.

We can learn from analyzing user behavior.

And we can look at our relevance scores.

Naviga/onal Exploratory

Claudia Hauff, Query Difficulty for Digital Libraries [2009]

There are many pre-‐ and post-‐retrieval signals.


Queries vary in difficulty. Recognize and adapt.

Review

1.  Lessons from Library Science •  Act like a librarian. Communicate with users.

2. Adventures with InformaAon ExtracAon

•  EnAty detecAon is crucial. And isn’t that hard. 3. A Moment of Clarity

•  Queries vary in difficulty. Recognize and adapt.

Conclusion: Read the textbooks.

But treat search as a communica/on problem.

WE’RE HIRING! hbp://data.linkedin.com/search

Contact me: [email protected]

hbp://linkedin.com/in/dtunkelang

search as communication: lessons from a personal journey

Technology