towards methods for the collective gathering and quality control of relevance assessments

Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments

SIGIR´09, July 2009

Summary Motivation Overview Related Work Methodology Pilot Study Analysis and Findings Conclusions

Motivation With the advent of the technology more and more

interest and use has been given to digital files, like digital books, audio, and video.

These digital files present new challenges in the constructions of test collections, more specifically collecting relevance assessments to tune system performance. This is due to: The length and cohesion of the digital item Dispersion of topics within it

Proposal => Develop a method for the collective gathering of relevance assessments using a social game model to instigate participants’ engagement.

Overview

Test collections consist of:• A corpus of documents• A set of search topics• And relevance

assessments collected from human judges

<doc> <docno> WSJ88046-0090 </docno><hl> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </hl><author> Janet Guyon </author><dateline> New York </dateline><text>American Telephone & Telegraph Co. Introduced the first of a new....</text></doc>

<top><num> Number: 168<title> Topic: Financing AMTRAK<desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK).

<narr> Narrative: A relevant document must provide information on the government’s responsability to make AMTRAK an economically viable entity. It could also discuss.. </top>

Document (TREC)

Topic (TREC)

Overview Test Collection Construction (in TREC):

A set of documents and a set of topics are given to the TREC participants

Each participant runs the topics against the documents using their retrieval system.

A ranked list of the top k documents per topic are return to TREC.

TREC forms pools (selects top k documents) from the participants’ submission, which are judged by the relevance assessors.

Each submission is then evaluated using the resulting relevance judgment, and the evaluation results are then returned to the participants.

Related work Gathering relevance judgments:

Single judge – usually the topic author assesses the relevance of documents to the given topic.

Multiple judges – assessments are collected from multiple judges and are typically converted to a single score per document.

In Web search judgments are collect from a representative sample of the user population. Also often user logs are mined for indicators of user satisfaction with the retrieved documents.

Related work In their approach, they extended the use of

multiple assessors per topic by: Facilitating the review and re-assessment of

relevance judgments Enabling the communication between judges Providing an enrich collection of relevance labels

that incoporate different user profiles and user needs. This also enables the preservation and promotion of diversity of opinions.

Related Work

Methodology The Collective Relevance Assessment (CRA)

method involves three phases: Preparation of data and setting CRA objectives

Methodology Design of the game

Methodology Relevance Assessment System

Pilot Study Two rounds: First last 2 weeks, the second

lasted 4 weekes Data:

INEX 2008 Track (50,000 digitized books,17 million Scanned pages, 70 topic TREC style)

Participants: 17 Participants

Collected Data Highlithed document regions Binary relevance level per page Notes and comments Relevance degree assigned to a book

Analysis and Findings Properties of the methodology:

Feasibility – engagement level comparable to the INEX 2003 Completeness and Exhaustiveness – 17,6% max completeness

level. Semantic Unit and Cohesion – relevance information forms a

minor theme of the book. Relevant content is disperse. Browsing and Relevance Decision – assessors requerie

contextual information to make a decision. Influence of incentive structures

Exloring vs. Reviewing Assessment Strategies

Quality of the collected Data: Assessor agreement –the level of agreement is higher

comparing with TREC and INEX. Annotations

Conclusions The CRA method sucessfully expanded

traditional methods and introduced new concepts for gathering relevant assessment.

Encourages personalized and diverse perspectives on the topics.

Promotes the collection of rich contextual data that can assist with interperting relevance assessments and their use for system optimization.

towards methods for the collective gathering and quality control of relevance assessments

Documents