search as communication: lessons from a personal journey
DESCRIPTION
Search as Communication: Lessons from a Personal Journey by Daniel Tunkelang (Head of Query Understanding, LinkedIn) Presented at Etsy's Code as Craft Series on May 21, 2013 When I tell people I spent a decade studying computer science at MIT and CMU, most assume that I focused my studies in information retrieval — after all, I’ve spent most of my professional life working on search. But that’s not how it happened. I learned about information extraction as a summer intern at IBM Research, where I worked on visual query reformulation. I learned how search engines work by building one at Endeca. It was only after I’d hacked my way through the problem for a few years that I started to catch up on the rich scholarly literature of the past few decades. As a result, I developed a point of view about search without the benefit of academic conventional wisdom. Specifically, I came to see search not so much as a ranking problem as a communication problem. In this talk, I’ll explain my communication-centric view of search, offering examples, general techniques, and open problems. -- Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.TRANSCRIPT
Search as Communica/on: Lessons from a Personal Journey
Daniel Tunkelang Head of Query Understanding, LinkedIn
These are great textbooks on informa/on retrieval.
Unfortunately, I never read them in school.
But I did study graphs and stuff.
I found myself developing a search engine.
And the next thing I knew, I was a search guy.
So what did I learn along the way?
Search isn't a ranking problem. It's a communica/on problem.
Outline
1. Lessons from Library Science 2. Adventures with InformaAon ExtracAon 3. A Moment of Clarity
1. Lessons from Library Science
InformaAon need query select from results
rank using IR model
USER:
SYSTEM: M-‐idf PageRank
A birds-‐eye view of how search engines work.
Old school search: ask a librarian.
Search lives in an informa/on-‐seeking context.
[Pirolli and Card, 2005]
vs.
Recognize ambiguity and ask for clarifica/on.
Clarify, then refine.
Computers Books
Faceted search. It’s not just for e-‐commerce.
Give users transparency, guidance, and control.
Take-‐away for search engine developers:
Act like a librarian. Communicate with your user.
2. Adventures with Informa/on Extrac/on
String matching is great but has limits.
20 20
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
People search for en//es. Recognize them!
Named en/ty recogni/on is free, as in free beer.
Problem: they process each document separately.
EnAty DetecAon System
Why not take advantage of corpus features?
Give your documents the right to vote!
Use a high-‐recall method to collect candidates. • e.g., all Atle-‐case spans of words other
than single word beginning a sentence. Process each document separately.
• Each candidate is assigned an enAty type, or no type at all.
If a candidate is mostly assigned a single enAty type, extrapolate to all its occurrences.
Looking for topics? Use idf, and its cousin ridf.
Inverse document frequency (idf) • Too low? Probably a stop word. • Too high? Could be noise. Residual inverse document frequency (ridf) • Predict idf using Poisson model. • Difference between idf and predicted idf.
“a good keyword is far from Poisson” [Church and Gale, 1995]
Terminology extrac/on? Try data recycling.
Obtain en//es by any means necessary.
Take-‐away for search engine developers:
En/ty detec/on is crucial. And it isn’t that hard.
3. A Moment of Clarity
informaAon Need query select from results
rank using IR model
USER:
SYSTEM: M-‐idf PageRank
Let’s go back to our pigeons for a moment.
What does this process look like to the system?
vs.
And here’s what it looks like to the user.
GOOD NOT SO GOOD
But can the system tell the difference?
User experience should reflect system confidence.
vs.
h^p://searchengineland.com/ge`ng-‐organized-‐paid-‐search-‐user-‐intent-‐the-‐search-‐funnel-‐116312 Derived from [Jansen et al, 2007].
Searches reflect a variety of informa/on needs.
34 34
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
We can segment informa/on need from the query.
We can learn from analyzing user behavior.
And we can look at our relevance scores.
Naviga/onal Exploratory
Claudia Hauff, Query Difficulty for Digital Libraries [2009]
There are many pre-‐ and post-‐retrieval signals.
Take-‐away for search engine developers:
Queries vary in difficulty. Recognize and adapt.
Review
1. Lessons from Library Science • Act like a librarian. Communicate with users.
2. Adventures with InformaAon ExtracAon
• EnAty detecAon is crucial. And isn’t that hard. 3. A Moment of Clarity
• Queries vary in difficulty. Recognize and adapt.
Conclusion: Read the textbooks.
But treat search as a communica/on problem.
WE’RE HIRING! hbp://data.linkedin.com/search
Contact me: [email protected]
hbp://linkedin.com/in/dtunkelang