walid magdy, jinming min, johannes leveling, gareth jones
Post on 14-Jan-2016
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
Building a Domain-Specific Document Collection for Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information RetrievalEvaluating Metadata Effects on Information Retrieval
Walid Magdy, Jinming Min, Johannes Leveling, Gareth JonesSchool of Computing, Dublin City University, Ireland
20 May 2010
LREC 2010
Outline
CNGL
Objective
Data collection preparation and overview
IR test collection design
Baseline Experiments
Summary
CNGL
Centre of Next Generation Localisation (CNGL)
4 Universities: DCU, TCD, UCD, and UL
Team: 120 PhD students, PostDocs, and PIs
Supported by Science Foundation of Ireland (SFI)
9 Industrial Partners: IBM, Microsoft, Symantec, …
Objective: Automation of the localisation process
Technologies: MT, AH, IR, NLP, Speech, and Dev.
Objective
Create a collection of data that is:
1. Suitable for IR tasks
2. Suitable for other research fields (AH, NLP)
3. Large enough to produce conclusive results
4. Associated with defined evaluation strategies
Prepare the collection from freely available dataYouTube
Domain specific (Basketball)
Build standard IR test collection (document set + topics set + relevance assessment)
YouTube Videos Features
Document
Tags
- Video URL- Video Title
Posting User
Posting date
Description
Category
Number of
Views
Length
Responded Videos
Related Videos
Comments
Number of
Ratings
Number of
Favorited
Methodology for Crawling Data
50 NBA related queries used to search YouTube
First 700 results per query crawled with related videos
Crawled pages parsed and metadata extracted.
Extracted data represented in XML format
Non-sport category results filtered out
Used Queries:NBA - NBA Highlights - NBA All Starts - NBA fights
Top ranked 15 NBA players in 2008 + Jordan + Shaq
29 NBA teams
Data Collection Overview
Crawled video pages: 61,34061,340 pages
Max crawled related/responded video pages: 2020
Max crawled comments for a given video page: 500500
Comments associated with contributing user’s ID
Crawled user profiles ≈ 250k250k
XML sample
Topics Creation
<title>Michael Jordan best dunks</title>
<description>Find the best dunks through the career of Michael Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in. </description>
<narrative>A relevant video should contain at least one dunk for Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well</narrative>
40 topics (queries) created
Specific topics related to NBA
TREC topic = query (title) + description + narrative
Relevance Assessment
4 indexes created:Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
5 different retrieval models used
20 different result lists, each contains 60 documents
Result lists merged with random ranking
122 to 466 documents assessed per topic
1 to 125 relevant documents per topic (avg. = 23)
Baseline Experiments
Search 4 different indexes:Title
Title +Tags
Title + Tags + Description
Title + Tags + Description + Related videos titles
Indri retrieval model used to rank results
1000 results retrieved for each search
Mean average precision (MAP) used to compare the results
Results
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Title Title+Tags Title+Tags+Desc All text fields
MA
P
Summary (new language resource)
61,340XML docs
40 topics +rel. assess.
250,000User profiles
Comments
Ratings
# Views
MetadataIR test set
AH/Personalisation
Se
ntim
en
t A
nal
ysis
Videos
Multimedia processing
Reranking using ML
TagsNER
Top bigrams in Top bigrams in “Tags” field“Tags” field
Kobe BryantNBA BasketballLebron JamesMichael Jordan
Los AngelesAll Star
Chicago BullsBoston CelticsAllen Iverson
Angeles LakersSlam Dunk
Basketball NBADwight Howard
Vince CarterDwyane WadeKevin Garnett
Toronto RaptorsHouston Rockets
Miami HeatO’Neal
Phoenix SunsDetroit PistonsTracy Mcgrady
Yao MingChris Paul
Amazing HighlightsNew YorkPau Gasol
Cleveland CavaliersNBA Amazing
Top bigrams in Top bigrams in “Tags” field“Tags” field
Kobe BryantNBA BasketballLebron JamesMichael Jordan
Los AngelesAll Star
Chicago BullsBoston CelticsAllen Iverson
Angeles LakersSlam Dunk
Basketball NBADwight Howard
Vince CarterDwyane WadeKevin Garnett
Toronto RaptorsHouston Rockets
Miami HeatO’Neal
Phoenix SunsDetroit PistonsTracy Mcgrady
Yao MingChris Paul
Amazing HighlightsNew YorkPau Gasol
Cleveland CavaliersNBA Amazing
Questions & Answers
Q: Is this collection available for free?
A: No
Q: Nothing could be provided?
A: Scripts + Topics + Rel. assess. (needs updating)
Q: Any other questions?
A: …
Thank youThank you
top related