improving search results quality by customizing summary lengths michael kaisser ★, marti hearst ...

30
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser , Marti Hearst and John B. Lowe University of Edinburgh, UC Berkeley, Powerset, Inc. ACL-08: HLT

Upload: clare-wells

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

Improving Search Results Quality by Customizing Summary Lengths

Michael Kaisser★, Marti Hearst

and John B. Lowe

★University of Edinburgh, UC Berkeley, Powerset, Inc.

ACL-08: HLT

Page 2: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Talk Outline

How best to display search results? Experiment 1: Is there a correlation between

response type and response length? Experiment 2: Can humans predict the best

response length? Summary and Outlook

Page 3: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Motivation Web Search result listings today are largely

standardized; display a document’s surrogate (Marchionini et al., 2008)

Typically: One header line, two lines text fragments, one line for URL:

But: Is this the best way to present search results? Especially: Is this the optimal length for every query?

(Source: Yahoo!)

Page 4: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 1 – Research Question

Do different types of queries require responses of

different lengths?

(And if so, is the preferred response type dependent on the expected semantic response type?)

Page 5: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 1 – Setup

Data used: 12,790 queries from Powerset’s query database

Contains search engines’ query logs and hand crafted queries

disproportionally large number of natural language queries

Page 6: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 1 – SetupDisproportionally large number of natural language

queries.

Examples: “date of next US election” Hip Hop A synonym for material highest volcano What problems do federal regulations cause? I want to make my own candles industrial music

Page 7: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Excursus – Mechanical Turk Amazon web services API for computers to

integrate "artificial artificial intelligence" requesters can upload Human Intelligence Tasks

(HITs) Workers work on these HITs and are paid small

sums of money Examples:

can you see a person in the photo? is the document relevant to a query? is the review of this product positive or negative?

Page 8: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Excursus – Mechanical Turk Amazon web services API for computers to

integrate "artificial artificial intelligence" requesters can upload Human Intelligence Tasks

(HITs) Workers work on these HITs and are paid small

sums of money

Mechanical Turk is/can also be seen as a platform for online experiments

Page 9: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 1

Turkers are asked to classify queries by

• Expected response type

• Best response length

Each query is done by three different subjects.

Page 10: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Page 11: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 1 – Results

Distribution of length categories differs across individual expected response categories.

Some results are intuitive : Queries for numbers want short results Advice queries want longer results

Some results are more surprising: Different length distributions for Person vs.

Organization

Page 12: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 2 – Research Question

Can human judges correctly predict the preferred result

length?

Page 13: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 2 – Setup Experiment 1 produced 1099 high-confidence queries

(where all three turkers agreed on semantic category and length)

For 170 of these turkers manually created snippets from Wikipedia of different lengths: Phrase Sentence Paragraph Section Article (in this case a link to the article was displayed)

Note: Categories differ slightly from first experiment

Page 14: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 2 – Setup

Manually created snippets from Wikipedia of different lengths:

Page 15: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 2 – SetupDisplayed:

• Instructions

• Query

• One response from one length category

• Rating scale

Each Hit was shown to ten turkers.

Page 16: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Experiment 2 – SetupInstructions:

Below you see a search engine query and a possible response. We would like you to give us your opinion about the response. We are especially interested in the length of the response. Is it suitable for the query? Is there too much or not enough information? Please rate the response on a scale from 0 (very bad response) to 10 (very good response).

Page 17: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Page 18: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

Experiment 2 – Significance

Slope Std. Error p-value

Phrase -0.850 0.044 <0.0001

Sentence -0.550 0.050 <0.0001

Paragraph 0.328 0.049 <0.0001

Article 0.856 0.053 <0.0001

Michael Kaisser, Marti Hearst and John B. Lowe

Significance results of unweighted linear regression on the data for the second experiment, which was separated into four groups based on the predicted preferred length.

Page 19: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

Experiment 2 – Details 146 queries 5 length categories per query 10 judgments per query = 7,300 judgments

124 judges 16 judges did more than 146 hits 2 of these 16 were excluded (scammers)

$0.01 per judgment $73 paid at judges, plus $73 Amazon fees $146 for Experiment 2 (excluding snippet generation)

Michael Kaisser, Marti Hearst and John B. Lowe

Page 20: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Results: Human judges can predict the preferred result

lengths (at least for a subset of especially clear queries)

Experiment 2 – Results

Page 21: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Results: Human judges can predict the preferred result

lengths (at least for a subset of especially clear queries)

Standard results listings are often too short (and sometimes too long)

Experiment 2 – Results

Page 22: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

OutlookCan queries be automatically classified according

to their predicted result length?

Initial Experiment: Unigram word counts 805 training queries, 286 test queries Three length bins (long, short, other) Weka NaiveBayesMultinomial

Initial Result: 78% of queries correctly classified

Page 23: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Thank you!

Page 24: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

MT Demographics - Age

Michael Kaisser, Marti Hearst and John B. Lowe

Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html

Page 25: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

MT Demographics - Gender

Michael Kaisser, Marti Hearst and John B. Lowe

Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html

Page 26: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

MT Demographics - Education

Michael Kaisser, Marti Hearst and John B. Lowe

Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html

Page 27: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

MT Demographics - Income

Michael Kaisser, Marti Hearst and John B. Lowe

Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html

Page 28: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLT

MT Demographics - Purpose

Michael Kaisser, Marti Hearst and John B. Lowe

Survey, data and graphs from Panos Ipeirotis’ blog: http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-demographics.html

Page 29: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Page 30: Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,

ACL-08: HLTMichael Kaisser, Marti Hearst and John B. Lowe

Excursus – Mechanical TurkExample HIT (not ours):