experimental evaluation in visual information...
Post on 18-Aug-2018
228 Views
Preview:
TRANSCRIPT
l lExperimental evaluation in visual information retrievalvisual information retrieval
Paul Clough
Information SchoolInformation SchoolUniversity of Sheffield (UK)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18-19 June 2012
Areas of research
• Text re use and plagiarism detection
Areas of research
• Text re-use and plagiarism detection• Multilingual information access• Geographical Information Retrieval
(GIR)(GIR)• Multimedia retrieval (images)• Evaluation of IR systemsy• User interfaces and interaction• Construction of corpora and evaluation
resourcesesou ces
http://ir.shef.ac.uk/cloughie/
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18-19 June 2012
ContentsContents• Part 1 ‐ Evaluating (V)IR systems
– Visual IR systems– The evaluation landscape
• Part 2 – ImageCLEF for VIR evaluationPart 2 ImageCLEF for VIR evaluation– Overview of ImageCLEF– Example tasks– Main findings and lessons learnedMain findings and lessons learned
• Part 3 ‐ Addressing some issues in IR evaluation– Crowdsourcing for gathering relevance assessments
S t f d ti f ti– System performance measures and user satisfaction– Evaluating beyond single query‐response paradigm
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18-19 June 2012
Evaluating (V)IR systems
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Visual information retrieval
• Visual information retrieval (VIR)
Visual information retrieval
– Users want to retrieve visual documents rather than texts– Take into account visual properties of the data
• Example use cases for VIR systems• Example use cases for VIR systems– Researcher searching digital archives– Clinicians searching for medical images (e.g. x‐rays)– Illustrator looking for example photograph– Organisations checking for trademark infringements– Professionals accessing science databases (e.g. medicine, g ( g ,astronomy, geography)
– Domestic users browsing their personal photo collections
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18-19 June 2012
Retrieval methods
• Description‐based
Retrieval methods
p– Using abstracted features assigned to the image, e.g. metadata, captions, keywords, associated text
– Often assigned manually although can be automatic (e.g.Often assigned manually although can be automatic (e.g. object recognition and classification)
• Content‐based (CBIR)Using primitive features based on pixels which form the– Using primitive features based on pixels which form the visual content of the image, e.g. colour, shapes, textures
– Extracted automatically from imageC bi ti f b th h• Combinations of both approaches– Investigating fusion of multiple modalities was an objective of ImageCLEF
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Evaluating IR systems
• Evaluation is systematic determination of merit of something f d d [ ]
Evaluating IR systems
using criteria against a set of standards [Harman, 2011]• Evaluation is important for designing and developing
effective search systems (effective, efficient and usable)• Focus of evaluation will vary
– System (i.e. with little or no user involvement)– User and user interaction with system– User‐system interaction with environment
• Traditionally been a strong focus on measuring system effectiveness in controlled lab setting– Abstraction of reality (e.g. information need to query)– Comparative testing of systems (e.g. A vs. B – which is better?)– Does not account for contextual and situational factors (user’s
background and preferences search task )background and preferences, search task…)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Evaluating IR systems
• IR systems ultimately to be used by people for some purpose d
Evaluating IR systems
and operating in an environment• What makes an IR system successful?
– Whether it retrieves ‘relevant’ documents– How quickly it returns results– How well it supports user interaction– Whether the user is satisfied with the results
H il th t– How easily users can can use the system– Whether the system helps users carry out tasks– Whether system impacts on the wider environment
• Multiple evaluation methods and measures will be used• Multiple evaluation methods and measures will be used throughout IR system development– Evaluate components (e.g. IR system) vs. overall system
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Example: evaluating library catalogues
• Evaluation consultant for Search25 project in UK which is b ld f d d d l b l h l
Example: evaluating library catalogues
building a federated academic library catalogue search tool• Initial study of user needs and behaviours using online
questionnaire with users of current system (179 responses)
How important are the f ll i f t hfollowing factors when using academic library catalogues?
(rate on a scale of 1‐5,(rate on a scale of 1 5, where 1 is not important, and 5 is very important)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Evaluating IR systemsEvaluating IR systems• Evaluation of retrieval systems tends to focus on either the y
system (algorithms) or the user
• Saracevic (1995) distinguishes six levels of evaluation for i f ti t th t i l d IR tinformation systems that include IR systems
Engineering
InputIR l ti
Focus of IR evaluation mainly here (test collections and batch-mode lab style evaluation)Input
Processing
Output
IR evaluation
Interactive IR (IIR) l ti
mode lab-style evaluation)
ImageCLEF evaluations focused herep
Use and user
Social
(IIR) evaluation / Human information behaviour
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, United States, July 09 ‐ 13, 1995): SIGIR '95 (pp.1380146). New York, ACM Press
Evaluation landscape [Kelly, 2009]Evaluation landscape [Kelly, 2009]Real users, needs and situations
In situ (living lab)Simulated users
In situ (living lab)
Uncontrolled variablesLab‐based
Controlled variables
Inform
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Predict
IR test collections• Test collections provide re‐usable resources to evaluate
IR test collectionsp
IR systems in controlled lab setting– Collection of documents
Set of representative queries (topics)
‘Cranfield style’ test collection
– Set of representative queries (topics)– Set of relevance judgments for each topic– Evaluation measures (system performance)
Comparative system evaluation
• Test collection + measures provides simulation of user in operation setting (if designed carefully)– Do results obtained with test collections predict user taskDo results obtained with test collections predict user task
success or performance / satisfaction with results?– But what about beyond query‐response paradigm?– How do you integrate contextual and situational factors?How do you integrate contextual and situational factors?
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
TREC‐style evaluation
• Large‐scale evaluation to assess worth of new ideas in lab‐
TREC style evaluation
controlled setting– Enables comparative evaluation between systems based on
common resources and standardised methodologies (test g (collections)
• Many large‐scale activities run over the years– TREC INEX CLEF NTCIIRTREC, INEX, CLEF, NTCIIR, …
• Also carried out in field of VIR– TRECVid, PASCAL Visual Object Classes, MediaEval, ImageCLEF…
• Provided valuable resources and infrastructure to support researchers and shown to advance field of IR
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ResultsResults
Search Engine
Judges
F t t i l “L C t E l ti R li bilit d R bilit ” R SSIR 2011
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
From tutorial on “Low‐Cost Evaluation, Reliability and Reusability”, RuSSIR 2011, Evangelos Kanoulas
Practical issues• Gathering a collection of documents
Practical issues
• Generating a suitable set of queries/topics– How do I obtain the queries/topics?– How many queries/topics do I need?How many queries/topics do I need?
• Creating the relevance assessments– How do I gather the assessments?
Wh h ld d th t ?– Who should do the assessments?– How many assessments should be made?– What are the assessors expected to do?
h b fi di i i l d ?– What about finding missing relevant documents?
• Selecting a suitable evaluation measure• These decisions will affect the quality of the benchmark and q y
impact on the accuracy/usefulness of results
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Limitations of test collections
• Simulation/evaluation based on test collections
Limitations of test collections
Simulation/evaluation based on test collections– Individual differences between users are typically ignored in test collection setting
– Result presentation not part of test collection
– Collections grown in size but often numbers of test queries remains smallremains small
– Limited diversity in tasks (e.g. how about evaluation of navigation, resource finding … explorative search)g g p )
– Ignores longitudinal process of searching
• Alternatives– Interactive evaluation, (static) log file analysis, living labs …
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF for VIRImageCLEF for VIR evaluation
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF
• International evaluation campaign for evaluating (cross‐
ImageCLEF
language) image retrieval (2003 – now)– Part of Cross Language Evaluation Forum (CLEF)
• Comparative evaluation based on common resources andComparative evaluation based on common resources and standardised methodologies
• Main objectives of ImageCLEFT d l h i f f h l i f i l– To develop the necessary infrastructure for the evaluation of visual information retrieval systems (e.g. resources, organised events…)
– To investigate effectiveness of combining textual and visual features
– To promote the exchange of ideas towards the further advancement of the field of visual media analysis, indexing, classification and retrieval
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
•http://www.imageclef.org/
ImageCLEF tasks
• Ad hoc retrieval (since 2003)
ImageCLEF tasks
( )– Multiple querying modalities (e.g. QBVE)– Fusion of retrieval methods (TBIR and CBIR)– Promoting diversity in results– Promoting diversity in results
• Object and concept recognition (since 2005)– Object class recognition to identify whether certain objects
f d fi d f l i dor concepts from a pre‐defined set of classes are contained in an image
– Image annotation to assign textual labels or descriptions to ian image
– Automatic image classification to classify images into one or many classes
( )• Interactive image retrieval (since 2003)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Defining suitable tasks
• Where possible tasks have been informed by operational ( ) d l d
Defining suitable tasks
settings (use cases) and involved expert assessors– Tasks for medical image retrieval based on interviews with
clinicians, assessments performed by trained medical professionals and datasets realistic of medical domainprofessionals and datasets realistic of medical domain
– Many queries for non‐medical ad hoc tasks derived from analysing query logs of search systems hosting datasets
• Also attempted to introduce challenging and novel tasks to• Also attempted to introduce challenging and novel tasks to interest researchers– Retrieval and classification on ‘large’ image datasets– Promoting diversity for ad hoc search– Promoting diversity for ad hoc search– From image retrieval to case‐based retrieval (medical)
• But getting the balance right is hard!
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF participationImageCLEF participation
Participation in the ImageCLEF tasks and number of participants by year (2003‐2010)
Historical images
Personal images
News photo archive
Wikipediap
Medical datasets (x‐rays etc.)( y )
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF datasetsImageCLEF datasets
Datasets developed in ImageCLEF (2003‐2009)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF 2009ImageCLEF 2009
• Variety of retrieval tasks– Photographic retrieval– Medical image retrieval– Interactive retrieval– Automatic medical image annotation– Large‐scale visual concept detection– Wikipedia image retrievalWikipedia image retrieval– (Robot vision task)
• Pre‐CLEF workshop Visual retrieval evaluation– Visual retrieval evaluation
– Sponsored by THESEUS
• 84 groups registered– 62 groups submitting results
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Example task: promoting diversity
• A system retrieving a spread of results for a user need or one
Example task: promoting diversity
y g pthat retrieves results across interpretations of a query is said to be a system that promotes a diverse ranking
T h i t t di it f h lt t b• Techniques to promote diversity of search results seem to be gaining wide adoption in the commercial web search sector
• However, at the time (2008 and 2009) there were almost no , ( )test collections available to evaluate different techniques in a standardised manner
F di id i di i i i i l• Few studies considering diversity in image retrieval
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
The idea of diversityThe idea of diversity
C1 C1 C1 C1 C1 C1 C1C4C3 C5C1
C1 C1
C1 C1 C1 C1
C1 C1 C1
C1 C1
C2
C4
C3
C3
C4C5
C5
C6
• P10 = 1.00
C1 C1 C1 C1 C1 C2C3 C4C5 C6
• P10 = 1.00
• Cluster Recall at 10: • Cluster Recall at 10:
1 covered sub-topic 6 covered sub-topic
0 167
p
6 total subtopics=
p
6 total subtopics=
1 0000.167 1.000
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Example task: promoting diversity
• Designed task where participants had to present as
Example task: promoting diversity
Designed task where participants had to present as many diverse results in the top 10 results– Belgian news agency (Belga) provided dataset consisting 498,039 images with unstructured captions (English)
– 50 topics provided based on manual analysis of logs from Belga (average of 3.96 clusters for each topic)g ( g p )
– 44 institutions registered (19 submitted runs)– Evaluation measures were P@10 and CR@10 (combined using F1)using F1)
– CR@10 is known as cluster recall at rank 10 and measures how many clusters are covered in the top n results
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Query Part 1 Query Part 2<title> clinton </title> <title> obama </title><title> clinton </title> <title> obama </title><clusterTitle> hillary clinton </clusterTitle>
<clusterDesc> Relevant images show photographs of Hillary Clinton. Images of Hillary with other people are relevant if she is shown in the foreground. Images of her in the background are irrelevant. </clusterDesc>
<image> belga26/05859430.jpg </image> <image> belga30/06098170.jpg </image>
<clusterTitle> obama clinton </clusterTitle><clusterTitle> obama clinton </clusterTitle>
<clusterDesc> Relevant images show photographs of Obama and Clinton. Images of those two with other people are relevant if they are shown in the foreground. Images of them in the background are irrelevantImages of them in the background are irrelevant. </clusterDesc>
<image> belga28/06019914.jpg </image> <image> belga28/06019914.jpg </image><clusterTitle> bill clinton </clusterTitle>
<clusterDesc> Relevant images show photographs of Bill Clinton. Images of Bill with other people are relevant if he is shown in the foreground. Images of him in the background are irrelevant. </clusterDesc>
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
<image> belga44/00085275.jpg </image> <image> belga30/06107499.jpg </image>
Example task: promoting diversityExample task: promoting diversity
No Group Run Name Query Modality P@10 CR@10 F1p y y @ @1 XEROX-SAS XRCEXKNND T-CT-I TXT-IMG 0.794 0.824 0.8092 XEROX-SAS XRCECLUST T-CT-I TXT-IMG 0.772 0.818 0.7943 XEROX-SAS KNND T-CT-I TXT-IMG 0.8 0.727 0.7624 INRIA LEAR5_TI_TXTIMG T-I TXT-IMG 0.798 0.729 0.7625 INRIA LEAR1_TI_TXTIMG T-I TXT-IMG 0.776 0.741 0.7586 InfoComm LRI2R_TI_TXT T-I TXT 0.848 0.671 0.7497 XEROX SAS XRCE1 T CT I TXT IMG 0 78 0 711 0 7447 XEROX-SAS XRCE1 T-CT-I TXT-IMG 0.78 0.711 0.7448 INRIA LEAR2_TI_TXTIMG T-I TXT-IMG 0.772 0.706 0.7379 Southampton SOTON2_T_CT_TXT T-CT TXT 0.8240 0.654 0.72910 Southampton SOTON2_T_CT_TXT_IMG T-CT TXT-IMG 0.746 0.71 0.727p _ _ _ _
• Cluster information is essential for providing diverse results• A combination of T‐CT‐I maximizes diversity
U i i d d lit hi d th hi h t F1
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
• Using mixed modality achieved the highest F1
Example task: interactive IR (iCLEF)
• Run in conjunction with CLEF interactive task (iCLEF) from d d d h l f k
Example task: interactive IR (iCLEF)
2005 and conducted in the style of TREC interactive task– Experiments typically hypothesis‐driven, and interfaces studied
and compared using controlled user populations under laboratory conditionslaboratory conditions
– Participants recruit users to perform experiments and these have provided valuable insights into interactive IR
• But there are problems with interactive tasks• But there are problems with interactive tasks– User populations typically small in size (e.g. 8 participants)– Cost of training users, scheduling and monitoring search
sessions highsessions high– Factors such as the user interface and relevance criteria affect
success– Hard to produce subsequent comparisons outside experimentalHard to produce subsequent comparisons outside experimental
setup
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Example task: interactive IR (iCLEF)
• Tried new approach in 2008‐09 with different goal [Gonzalo et
Example task: interactive IR (iCLEF)
al., 2008; Gonzalo et al., 2009]– To harvest a large search log of users performing multilingual searches
on Flickr.com in an online gaming environmentO i id d d f lt ltili l h i t f• Organisers provided default multilingual search interface– Functions for registering and monitoring users– Monolingual and multilingual search functionality
d l l d– Customised logging capturing user‐system interaction including explicit success/failure of searches, users’ profiles, and post‐search questionnaires for every search
P ti i t ld f t t k• Participants could perform two tasks– Generation and analysis of search logs– Conduct their own lab‐based interactive IR experiments
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Example task: interactive IR (iCLEF)Example task: interactive IR (iCLEF)
Gonzalo J Clough P and Karlgren J (2009) Overview of iCLEF2008: Search Log Analysis for Multilingual Image
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Gonzalo, J., Clough, P. and Karlgren, J. (2009), Overview of iCLEF2008: Search Log Analysis for Multilingual Image Retrieval, In Proceedings of 9th Workshop of the Cross‐Language Evaluation Forum (CLEF'08), September 17‐19 2008,
LNCS 5706, pp. 227‐235.
Example task: interactive IR (iCLEF)
• Total of 2 million lines of log data generated in 2008‐09
Example task: interactive IR (iCLEF)
– 435 users contributed to logs and generated 6,182 valid search sessions
2008 2009
Subjects/users 305 130
Log lines 1,483,806 617,947
Target images 103 132
Valid search sessions 5 101 2 410
Download and use the dataValid search sessions 5,101 2,410
Successful sessions 4,033 2,149
Unsuccessful sessions 1,068 261
Hints asked 11,044 5,805
http://nlp.uned.es/iCLEF/
Queries in monolingual mode 37,125 13,037
Queries in multilingual mode 36,504 17,872
Manually promoted translations 584 725
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Manually penalised translations 215 353
Image descriptions inspected 418 100
Example task: interactive IR (iCLEF)
• Logs provide rich source of information for studying
Example task: interactive IR (iCLEF)
multilingual search behaviour– Investigating the effects of language skills on search behaviour– Discovering actions leading to an abort– Observing the switching behaviour of users within a search task
• Community and game‐like way is perhaps one way to generate resources to help analyse user‐system interactions and searching behavioursand searching behaviours– But there are limitations to this approach: the logs reflect only a single
search task (known‐item retrieval) using a pre‐defined search interface• In future could use logs to record behaviour for specific• In future could use logs to record behaviour for specific
version of user interface with systematic modifications to compare various search assistance functionalities
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
ImageCLEF: organisational challenges
• Obtaining funding (e.g. for relevance assessments, invited
ImageCLEF: organisational challenges
speakers…)• Obtaining access rights to image datasets• Motivating participation across multiple domains• Motivating participation across multiple domains • Motivating submission of results (<50% of groups who register
actually submit)• Difficult to get interest from commercial organisations to
inform operational settings• Creating realistic tasks and user models (esp. in TREC‐style C eat g ea st c tas s a d use ode s (esp C sty e
evaluation event)• Efficiently creating ground truths (esp. for medical tasks)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Müller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180, 495 pages.
Some of the main findings
• Consistent study over the years of combinations of image and
Some of the main findings
y y gtextual information– 62% of papers submitted to ImageCLEF 2003‐2009 proceedings used
combinations of CBIR and TBIR– Of those using combinations, 60% used approach based on combining
multiple results lists rather than using multi‐modal indexing– Consistent improvements (overall) using CBIR+TBIR, but query
dependent• Multilingual search just as effective as monolingual• For certain domains (e.g. medicine) use of known resources ( g )
(e.g. UMLS) helps with indexing and query expansion• Doing interactive tasks in TREC‐style setting is still hard!
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Müller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180, 495 pages.
Contributions of ImageCLEF – impact [Tsikrika l 2011]et al., 2011]
• Around 70% of citations are from papers not in CLEF proceedings• 8 62 cites per paper on average• 8.62 cites per paper on average
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Addressing some issues in gIR evaluation
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Addressing issues in IR evaluation (with using ll i )
• Can we gather relevance assessments efficiently and
test collections)
g yeffectively? – Often causes a bottleneck in evaluation and can be very resource intensive
• Does system performance translate to user satisfaction and success?satisfaction and success?– If not then evaluating with test collections is limited
• Can we adapt test collections to deal with multi‐Can we adapt test collections to deal with multiquery sessions?– This would reflect more realistic searching behaviours
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Generating relevance assessments
• Relevance assessment is time‐consuming and causes
Generating relevance assessments
gbottleneck in IR evaluation– Often requires input of domain experts– Pooling is commonly used to form sets of documents forPooling is commonly used to form sets of documents for assessors to judge
– Coverage of pools (and depth)• More efficient approaches to judge relevance?• More efficient approaches to judge relevance?
– Move‐to‐Front (MTF) pooling– Interactive Search and Judge– Use of sampling techniques– Use of implicit judgments (e.g. from log data)– CrowdsourcingCrowdsourcing
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Crowdsourcing
• Crowdsourcing is act of taking job traditionally performed by d d d d f d ll
Crowdsourcing
designated person and outsourcing to undefined, generally large, group of people in form of open call
• Amazon Mechanical Turk (AMT) is example crowdsourcing l tfplatform– Requester creates Human Intelligence Tasks (HITs)– Workers chose to complete HITs
Req esters assess res lts and pa orkers– Requesters assess results and pay workers– Currently > 200,000 workers from many countries
• Crowdsourcing shown to be feasible for relevance assessment [Alonso & Mizzaro 2009; Kazai 2011; Carvalho et al 2011][Alonso & Mizzaro, 2009; Kazai, 2011; Carvalho et al., 2011]
• However, previous studies also showed that domain expertise can have effect on judgments [Bailey et al., 2008; Kinney et al 2008]al., 2008]
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Crowdsourcing study at the UK National A hi [Cl h l 2012]
• Designed crowdsourcing experiment using AMT to gather l ff f
Archives [Clough et al., 2012]
relevance assessments to measure effectiveness of two competing search engine at UK government National Archives– Selected AMT route as there was limited access to search log dataC d f AMT i h h f d i• Compared assessments from AMT with those from domain expert and measured impact on effectiveness scores and rankings of Systems A and B
48 queries selected by domain expert– 48 queries selected by domain expert– Queries issues to Systems A and B and 10 highest results
retrieved/judged– Effectiveness assessed at rank 10 (P@10 and DCG@10)( @ @ )– HIT consisted of being shown query, description of query intent, 10
retrieved documents (from system A or B) and answering questions– Gathered 10 judgments per query‐system (960 HITS) over 2 weeks
73 k d d 924 HITS ( f i d)– 73 workers produced 924 HITS (after noise removed)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Crowdsourcing study at the UK National A hiArchives
Q i Q1 diffi lt Q2 f ili it Q3 fid Q4 ti f tiQueries Q1‐ difficulty 1=V. difficult; 5=V. easy
Q2 ‐ familiarity1=V. unfamiliar; 5=V. familiar
Q3 – confidence1=Not at all confident; 5=V. confident
Q4 – satisfaction1=V. unsatisfied; 5=V. satisfied
Expert All 4.36 4.34 4.25 3.25
Informational 4.10 4.13 4.00 3.04
Navigational 4.63 4.56 4.50 3.46
System A 4.54 4.44 4.44 3.90
System B 4.19 4.25 4.06 2.60
Crowd‐sourced workers
All 3.47 3.54 4.12 4.13
Informational 3.48 3.43 4.04 4.05
N i ti l 3 47 3 65 4 20 4 21Navigational 3.47 3.65 4.20 4.21
System A 3.42 3.57 4.18 4.18
System B 3.52 3.51 4.05 4.08
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Questionnaire results for expert and crowdsourced worker responses
Queries DCG P@10
Absolute scores between expert and workers differ but ranking of Systems A and B remains stable (A judged the better system).
Th j d b h d d k d hsystem A All (N=48) 0.492** 0.285*
Informational 0.323 ‐0.029
Navigational 0.485* 0.467*
B All (N 48) 0 601** 0 595**
The judgments between the crowdsourced workers and the expert are more interchangeable for system B than A, despite the resulting differences in absolute scores.
Navigational queries correlate better than informational. Thesystem B All (N=48) 0.601** 0.595**
Informational 0.563** 0.523**
Navigational 0.772** 0.786**
Navigational queries correlate better than informational. The lower correlation between expert and crowdsourced workers for system A, particularly for informational queries, suggests the results of a higher quality search engines are more difficult to assess using crowdsourcing. g g
10.00
12.00
10.00
12.00
System A (the better system) System B
2.00
4.00
6.00
8.00
Expe
rt DCG
score
2.00
4.00
6.00
8.00
Expe
rt DCG
score
0.00
0.00 2.00 4.00 6.00 8.00 10.00
MTurker DCG score
Informational Navigational
0.00
0.00 2.00 4.00 6.00 8.00 10.00MTurker DCG score
Informational Navigational
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
AMT cost: $43 for 45 hrs (73 assessors) TNA cost: $106 for 3 hrs 5 mins
Usefulness of using test collections in IR d l
• Effectiveness of IR systems typically measured based on the
development
y yp ynumber of “relevant” items found (Precision, Recall, DCG...)– Test collection and measure predict user behaviour: if system A
scores higher than B on test collection we assume users willscores higher than B on test collection we assume users will prefer system A over B in an operational setting
• ButSeveral past studies shown that a high increase in system– Several past studies shown that a high increase in system effectiveness did not have detectable gains for the end user in practice (i.e. not correlated with user satisfaction / success)The real issue in IR system design is not whether P/R goes up– The real issue in IR system design is not whether P/R goes up, but rather whether it helps users perform search tasks more effectively
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Usefulness of using test collections in IR d l [S d l 2010]
• Experiment conducted to examine relation of system
development [Sanderson et al., 2010]
p yeffectiveness with user preference on large scale
• Study involved 296 Amazon Mechanical Turk (AMT) users working with 30 topics comparing user preferences across 19working with 30 topics comparing user preferences across 19 runs submitted to TREC 2009 Web track
• Sampled range of runs with large and small relative differences in evaluation measures
• Lists of results randomly shown to AMT users side‐by‐side and asked to make a preference judgment for given search topicp j g g p
• Total cost of experiment < $60
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Usefulness of using test collections in IR d l [S d l 2010]
• Experiment conducted to examine relation of system
development [Sanderson et al., 2010]
p yeffectiveness with user preference on large scale
• Study involved 296 Amazon Mechanical Turk (AMT) users working with 30 topics comparing user preferences across 19working with 30 topics comparing user preferences across 19 runs submitted to TREC 2009 Web track
• Sampled range of runs with large and small relative differences in evaluation measures
• Lists of results randomly shown to AMT users side‐by‐side and asked to make a preference judgment for given search topicp j g g p
• Total cost of experiment < $60
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Usefulness of using test collections in IR d l [S d l 2010]development [Sanderson et al., 2010]
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Usefulness of using test collections in IR d l [S d l 2010]
• Found clear evidence that effectiveness measured on test
development [Sanderson et al., 2010]
collection predicted user preferences for one IR system over another– Strength of prediction varied by search type (inf / nav)Strength of prediction varied by search type (inf / nav)
• When comparing measures it was found that P@10 poorly modelled user preferences (ERR and nDCG best)U f b t i f t h h d• User preferences between pairs of systems where one had failed to return any relevant item were significantly stronger compared to rankings with at least one relevant document– Measures need adjusting to account for this
• Still preliminary work that requires further validation
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Adapting test collections to deal with sessions TREC S i T k
• TREC Session Track started in 2010 with intention of
– TREC Session Track
providing test collections for studying IR over sessionsrather than one‐shot queries
• 2011’s goal was to provide best possible results for2011 s goal was to provide best possible results for mth query in session given prior session data
• Session data consisted of– Current query qm– Set of past queries in session q1, q2, …, qm‐1– Ranked list of URLS for each past queryp q y– Set of clicked URLs/snippets and time spent by user reading them
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
http://trec.nist.gov/pubs/trec20/papers/SESSION.OVERVIEW.2011.pdf
Adapting test collections to deal with sessions TREC S i T k
• Participants ran IR systems over current query under
– TREC Session Track
p y q yfour conditions considered separately– RL1: Ignoring session data prior to the query– RL2: considering only prior queriesRL2: considering only prior queries– RL3: considering prior queries and search result URLs– RL4: considering all data, including the items clicked on by users and time spent viewing itemsusers and time spent viewing items
• Provided test collection to participants– Used ClueWeb09 (Category B – 50M documents)– 76 session for 62 topics created (re‐used from TREC 2007 QA and 2009 Million Query tracks as they have sub‐topics)
– Custom‐built IR system developed (based on Yahoo! BOSS) y p ( )and used to generate session data
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Adapting test collections to deal with sessions TREC S i T k
• Judgments created by NIST assessors
– TREC Session Track
g y– For each topic a depth‐10 pool was formed from ranked results for past queries q1…qm‐1 produced by Yahoo! BOSS and top 10 documents from submitted runs on current query qm
– Documents judged with respect to the general topic and all sub‐topics
• Relevance Judgments– ‐2 for spam, 0 for not relevant, 1 for relevant, 2 for highly relevant, 3 for topics that are navigational in nature andrelevant, 3 for topics that are navigational in nature and the judged page is “key” to satisfying the need
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Adapting test collections to deal with sessions TREC S i T k
• Results from TREC 2011 indicate it is possible for
– TREC Session Track
psystems to use interaction data to improve results over a baseline using no interaction data at all– Open questions include: use of sub‐topic judgment andOpen questions include: use of sub topic judgment and how to deal with duplicates
RL1 ‐> RL4 (all sub‐topics)(all sub topics)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Challenges for IR evaluation
• Tague‐Sutcliffe [1996] highlights six issues with IR
Challenges for IR evaluation
g [ ] g gevaluation– Should IR experiments involve real users with real information needs?information needs?
– Must IR evaluation involve actual retrieval processes?– What kind of aggregation is appropriate in evaluating gg g pp p gdifferent IR systems?
– What can analysis, as opposed to the experimental or qualitative collection of data tell us about IR systems?qualitative collection of data, tell us about IR systems?
– How can interactive IR systems be evaluated?– How generalisable are the results of IR systems?
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Conclusions
• Evaluating search is very important both in academic and
Conclusions
commercial contexts• Evaluation often performed using test collections which
provides valuable insights into IR algorithmsprovides valuable insights into IR algorithms– But need to validate the findings based on test collections with users
and in realistic settings – System evaluation is part of wider evaluation activitiesy p
• ImageCLEF focused on system‐oriented evaluation and inherits limitations
But created variety of realistic tasks and studied user interaction– But created variety of realistic tasks and studied user interaction
• Future work considering evaluating wider IR applications (search is one component) and varying search strategies (e.g. b i ) i t ll d l b b d i tbrowsing) using controlled lab‐based experiments
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Evaluating Information Access Systems (ELIAS)
• ELIAS is an ESF Research Networking Programme launched in
Evaluating Information Access Systems (ELIAS)
g g2011 for duration of 5 years (http://elias‐network.eu/)
• Study living laboratories for the evaluation of information i th laccess in the large
• Horizontal dimension– Domains and application areasDomains and application areas
• Vertical dimension– Fundamental questions, methodological and user simulation issues to
b dd dbe addressed
• Money available to support students/researchers doing evaluation research
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
Very useful source for test collections
Sanderson, M. “Test Collection Evaluation of Ad-hoc Retrieval Systems”,
Very useful source for test collections
Foundations and Trends® in Information Retrieval, 2010
69 pages with69 pages with 276 articles reviewed
Created with Wordle: http://www.wordle.net
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
References
Carvalho, V. R., Lease, M., & Yilmaz, E. (2011) Crowdsourcing for search evaluation, ACM SIGIR F 44 17 22
References
Forum, 44, 17–22.
Alonso, O., & Mizzaro, S. (2009) Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment, In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 15–16.
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 667–674.
Clough, P., Gonzalo, J., Karlgren, J., Barker, E., Artiles, J. and Peinado, V. (2008), Large‐Scale Interactive Evaluation of Multilingual Information Access Systems ‐ the iCLEF Flickr Challenge , In Proceedings of Workshop on novel methodologies for evaluation in information retrieval, ECIR 2008, 33‐38.2008, 33 38.
Clough, P., Sanderson, M., Tang, J., Gollins, T. and Warner, A. (2012) Examining the limits of crowdsourcing for relevance assessment, IEEE Internet Computing, 28 Jun. 2012. IEEE computer Society Digital Library. IEEE Computer Society (doi ieeecomputersociety org/10 1109/MIC 2012 95)(doi.ieeecomputersociety.org/10.1109/MIC.2012.95)
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
References
Gonzalo, J., Clough, P. and Karlgren, J. (2009), Overview of iCLEF2008: Search Log Analysis for M l ili l I R i l I P di f 9 h W k h f h C L E l i
References
Multilingual Image Retrieval, In Proceedings of 9th Workshop of the Cross‐Language Evaluation Forum (CLEF'08), September 17‐19 2008, LNCS 5706, 227‐235.
Harman, D. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119.
Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. Advances in Information Retrieval, 165–176.
Kelly D. (2009) Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information RetrievalFoundations and Trends in Information Retrieval
Kinney, K. A., Huffman, S. B., & Zhai, J. (2008). How evaluator domain expertise affects search result relevance judgments. Proceeding of the 17th ACM conference on Information and knowledge management (pp. 591–598). ACM.
Müll H Cl h P D l T d C t B (Ed )(2010) I CLEF E i t lMüller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
References
Tsikrika, T., Seco de Herrera, A.G., & Müller, H. (2011) Assessing the scholarly impact of i CLEF I P di f h S d i i l f M l ili l d
References
imageCLEF. In Proceedings of the Second international conference on Multilingual and multimodal information access evaluation (CLEF'11), Pamela Forner, Julio Gonzalo, Jaana Kekäläinen, Mounia Lalmas, and Maarten de Rijke (Eds.). Springer‐Verlag, Berlin, Heidelberg, 95‐106.
Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, United States, July 09 ‐ 13, 1995): SIGIR '95 (pp.1380146). New York, ACM Press.
Sanderson, M., Paramita, M., Clough, P. and Kanoulas, E. (2010) Do user preferences and evaluation measures line up?, In Proceedings of the 33rd Annual ACM SIGIR Conference, Geneva, Switzerland, pp. 555‐562.
Sanderson, M. (2010) Test Collection Evaluation of Ad‐hoc Retrieval Systems, Foundations andSanderson, M. (2010) Test Collection Evaluation of Ad hoc Retrieval Systems, Foundations and Trends in Information Retrieval, 4(2010), 247‐375.
Tague‐Sutcliffe, J.M. (1996) Some perspectives on the evaluation of information retrieval systems, Journal of the American Society for Information Science, 47(1), 1‐3.
Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012
top related